Distributed and Cloud Systems Lab

"The important thing in life is to have a great aim, and the determination to attain it" - Gothe.

DISCO: Distributed and Cloud Computing Systems Lab

The DISCO Lab aims to explore in-depth understanding of Distributed and Cloud computing with augmented services, and develop open-source techniques to enhance the system performance, dependability, scalability, and sustainability. The research was supported in part by National Science Foundation.

The DISCO Lab is located in the Osbourne science and engineering building. The server room is furnished with datacenter blade facility that has three racks of HP ProLiant BL460C G6 blade server modules and a 40 TB HP EVA storage area network with 10 Gbps Ethernet and 8 Gbps Fibre/iSCSI dual channels. It has three APC InRow RP Air-Cooled and UPS equipments for maximum 40 kWs in the n+1 redundancy design.

News

Junxian's paper "FlashByte: Improving Memory Efficiency with Lightweight Native Storage" was accepted by IEEE/ACM CCGrid 2021.
Shaoqi's paper "An Efficient and Non-intrusive GPU Scheduling Framework for Deep Learning Training Systems" was accepted by ACM/IEEE SC 2020.
Wei's paper "OS-Augmented Oversubscription of Opportunistic Memory with a User-Assisted OOM Killer" was accepted by ACM Middleware 2019 (acceptance rate 24.5%).
Wei's paper "Pufferfish: Container-driven Elastic Memory Management for Data-intensive Applications" was accepted by ACM SoCC 2019 (acceptance rate 24.7%).
Eddie's paper "Semantic-aware Workflow Construction and Analysis for Distributed Data Analytic Systems" was accepted by ACM HPDC 2019 (acceptance rate 21%).
Shaoqi's paper "Scalable Distributed DL Training: Batching Communication and Computation" was accepted by AAAI 2019 (acceptance rate 16.2%).
Shaoqi's paper "Aggressive Synchronization with Partial Processing for Iterative ML Jobs on Clusters" was accepted by ACM Middleware 2018 (acceptance rate 23%).
Eddie's paper "Profiling Distributed Systems in Light-weight Virtualized Environments with Logs and Resource Metrics" was accepted by ACM HPDC 2018 (acceptance rate 19.5%).
Tiago's paper "Reference-distance Eviction and Prefetching for Cache Management in Spark" was accepted by IEEE ICPP 2018 (acceptance rate 28%).
Wei's paper "Characterizing Scheduling Delay for Low-latency Data Analytic Workloads" was accepted by IEEE IPDPS 2018 (acceptance rate 24.5%).
A joint paper with Dr. Palden Lama "Performance Isolation of Data-intensive Scale-out Applications in Multi-tenant Clouds" was accepted by IEEE IPDPS 2018 (acceptance rate 24.5%).
Wei's paper "Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization" was accepted by USENIX ATC 2017 (acceptance rate 21%).
Wei's paper "Addressing Memory Pressure in Data-Intensive Parallel Programs via Container based Virtualization" was accepted by IEEE ICAC 2017.
Wei's paper "Addressing Performance Heterogeneity in MapReduce Clusters with Elastic Tasks" was accepted by IEEE IPDPS 2017(acceptance rate 23%).
Shaoqi's paper "Network-Adaptive Scheduling of Data-Intensive Parallel Jobs in Clusters" was accepted by IEEE ICAC 2017.
Dazhao's paper "Adaptive Scheduling of Parallel Jobs in Spark Streaming" was accepted by IEEE INFOCOM 2017 (acceptance rate 21%).
A joint paper with Dr. Bo Wu "FLEP: Enabling Flexible and Efficient Preemption on GPUs" was accepted by ACM ASPLOS 2017(acceptance rate 17%).

Recent Project

SHF: Small: Lightweight Virtualization Driven Elastic Memory Management and Cluster Scheduling (Sponsor: NSF SHF-1816850, PI: X. Zhou. 7/2018 - 06/2022)

Data-centers are evolving to host heterogeneous workloads on shared clusters to reduce the operational cost and achieve high resource utilization. However, it is challenging to schedule heterogeneous workloads with diverse resource requirements and performance constraints on heterogeneous hardware. Data parallel processing often suffers from interference and significant memory pressure, resulting in excessive garbage collection and out-of-memory errors that harm application performance and reliability. Cluster memory management and scheduling is still inefficient, leading to low utilization and poor multi-service support. Existing approaches either focus on application awareness or operating system awareness, thus are not well positioned to address the semantic gap between application run-times and the operating system. This project aims to improve application performance and cluster efficiency via lightweight virtualization-enabled elastic memory management and cluster scheduling. It combines system experimentation with rigorous design and analyses to improve performance and efficiency, and tackle memory pressure of data-parallel processing. Developed system software will be open-sourced, providing opportunities to foster a large ecosystem that spans system software providers and customers.