1 code implementation • 23 May 2023 • Srinivas Sridharan, Taekyung Heo, Louis Feng, Zhaodong Wang, Matt Bergeron, Wenyin Fu, Shengbao Zheng, Brian Coutinho, Saeed Rashidi, Changhai Man, Tushar Krishna
Benchmarking and co-design are essential for driving optimizations and innovation around ML models, ML software, and next-generation hardware.
3 code implementations • 24 Mar 2023 • William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, Tushar Krishna
In this paper, we extend the open-source ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms.
no code implementations • 30 Nov 2022 • Divya Kiran Kadiyala, Saeed Rashidi, Taekyung Heo, Abhimanyu Rajeshkumar Bambhaniya, Tushar Krishna, Alexandros Daglis
To facilitate the design space exploration of such massive DL training clusters, we introduce COMET, a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training.
no code implementations • 22 Jul 2022 • Tarannum Khan, Saeed Rashidi, Srinivas Sridharan, Pallavi Shurpali, Aditya Akella, Tushar Krishna
Our results indicate that previously proposed RoCE congestion control schemes have little impact on the end-to-end performance of training workloads, motivating the necessity of designing an optimized, yet low-overhead, congestion control scheme based on the characteristics of distributed training platforms and workloads.
no code implementations • 9 Oct 2021 • Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridharan, Tushar Krishna
Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e. g., GPU/TPU).
no code implementations • 24 Sep 2021 • William Won, Saeed Rashidi, Sudarshan Srinivasan, Tushar Krishna
As model sizes in machine learning continue to scale, distributed training is necessary to accommodate model weights within each device and to reduce training time.
no code implementations • 19 Aug 2020 • Afshin Abdi, Saeed Rashidi, Faramarz Fekri, Tushar Krishna
In this paper, we consider the parallel implementation of an already-trained deep model on multiple processing nodes (a. k. a.