SHF: Small: Lightweight Virtualization Driven Elastic Memory Management and Cluster Scheduling

SHF: Small: Lightweight Virtualization Driven Elastic Memory Management and Cluster Scheduling (NSF SHF-1816850, 7/2018-6/2021)

Project description and goals

Data-centers are evolving to host heterogeneous workloads on shared clusters to reduce the operational cost and achieve high resource utilization. However, it is challenging to schedule heterogeneous workloads with diverse resource requirements and performance constraints on heterogeneous hardware. Data parallel processing often suffers from interference and significant memory pressure, resulting in excessive garbage collection and out-of-memory errors that harm application performance and reliability. Cluster memory management and scheduling is still inefficient, leading to low utilization and poor multi-service support. Existing approaches either focus on application awareness or operating system awareness, thus are not well positioned to address the semantic gap between application run-times and the operating system. This project aims to improve application performance and cluster efficiency via lightweight virtualization-enabled elastic memory management and cluster scheduling. It combines system experimentation with rigorous design and analyses to improve performance and efficiency, and tackle memory pressure of data-parallel processing. Developed system software will be open-sourced, providing opportunities to foster a large ecosystem that spans system software providers and customers.

The research project is executed in a cutting-edge lab located in the new science and engineering building. The server room is furnished with cutting-edge HP data center blade facility that has three racks of HP ProLiant BL460C G6 blade server modules and a 40 TB HP EVA storage area network with 10 Gbps Ethernet and 8 Gbps Fibre/iSCSI dual channels. It has three APC InRow RP Air-Cooled and UPS equipments for maximum 40 kWs in the n+1 redundancy design.

Participants

Dr. Xiaobo Zhou, The Principal Investigator
Junxian Zhao, PhD Research Assistant (2019.8 - )
Branden Boling, Graduate Research Assistant (2021.1 - )
Aidi Pi, PhD Research Assistant (2018.9 - 2021.5)
Wei Chen, PhD Research Assistant (2019.1 - 2019.5)
Shaoqi Wang, PhD Research Assistant (2019.6 - 2020.5)
Will Zeller, REU student (2018.9 - 2019.4)
Jisop Lee, REU student (2018.10 - 2019.5)

Project-sponsored Publications & Other Products

“Improving Concurrent GC for Latency Critical Services in Multi-tenant Systems”, Junxian Zhao, Aidi Pi, Xiaobo Zhou, Sangyoon Chang, and Chengzhong Xu. Proc. of the 23rd ACM/IFIP International Middleware Conference (Middleware), 13 pages, Nov 2022. The product is open-sourced at iGC.
“Holmes: SMT Interference Diagnosis and CPU Scheduling for Job Co-location”, Aidi Pi, Xiaobo Zhou, and Chengzhong Xu. Proc. of the 31st ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 12 pages, June 2022. The product is open-sourced at Holmes.
“Memory at You Service: Fast Memory Allocation for Latency Critical Services”, Aidi Pi, Junxian Zhao, Shaoqi Wang, and Xiaobo Zhou. Proc. of the 22nd ACM/IFIP International Middleware Conference (Middleware), 13 pages, November 2021. The product is open-sourced at Hermes.
“FlashByte: Improving Memory Efficiency with Lightweight Native Storage”, Junxian Zhao, Aidi Pi, Shaoqi Wang, and Xiaobo Zhou, Proc. of the 21th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid), 10 pages, May 2021. The product is open-sourced at FlashByte.
“Profiling and Improving Performance of Data-Intensive Applications in Cloud Systems”, Aidi Pi, PhD Thesis of University of Colorado, Colorado Springs, May 2021.
“Partitioning Communication and Computation in Parameter Server for Scalable DL Training”, Shaoqi Wang, Aidi Pi, Xiaobo Zhou, Jun Wang, and Cheng-Zhong Xu, IEEE Transactions on Parallel and Distributed Systems, Pages: 2144 - 2159, September 2021.
“Preemptive and Low Latency Datacenter Scheduling via Lightweight Containers”, Wei Chen, Xiaobo Zhou, and Jia Rao, IEEE Transactions on Parallel and Distributed Systems, Pages: 2749 - 2762, December 2020.
“An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems”, Shaoqi Wang, Oscar J. Gonzalezy, Xiaobo Zhou, Thomas Williams, Brian D. Friedman, Martin Havemann and Thomas Woo, Proc. of the IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 13 pages, November 2020.
“Toward Scalable Distributed Machine Learning on Data-Parallel Clusters”, Shaoqi Wang, PhD Thesis of University of Colorado, Colorado Springs, May 2020.
“OS-Augmented Oversubscription of Opportunistic Memory with a User-Assisted OOM Killer”, Wei Chen, Aidi Pi, Shaoqi Wang, and Xiaobo Zhou, Proc. of the 20th ACM/IFIP International Middleware Conference (Middleware), 13 pages, Davis, CA, December 2019. The product IntelLog is open-sourced at Github.
“Pufferfish: Container-driven Elastic Memory Management for Data-intensive Applications”, Wei Chen, Aidi Pi, Shaoqi Wang, and Xiaobo Zhou, Proc. of the 10th ACM Symposium on Cloud Computing (SoCC), 12 pages, Santa Cruz, CA, November 2019.
“Semantic-aware Workflow Construction and Analysis for Distributed Data Analytic Systems”, Aidi Pi, Wei Chen, Shaoqi Wang, and Xiaobo Zhou, Proc. of the 28th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 12 pages, Phoenix, June 2019. The product IntelLog is open-sourced at Github.
“It Can Understand the Logs, Literally”, Aidi Pi, Wei Chen, Will Zeller, and Xiaobo Zhou, Proc. of the IEEE IPDPS Workshop on High-Performance Big Data and Cloud Computing, 6 pages, Rio De Janeiro, May 2019.
“Lightweight Virtualization Driven Runtimes for Big-Data Applications”, Wei Chen, PhD Thesis of University of Colorado, Colorado Springs, May 2019.

Acknowledgements

This material is based upon work supported by the National Science Foundation under SHF-1816850. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).