1 code implementation • 16 Oct 2023 • Baodong Wu, Lei Xia, Qingping Li, Kangyu Li, Xu Chen, Yongqiang Guo, Tieyao Xiang, YuHeng Chen, Shigang Li
As a result, A substantial amount of training time is devoted to task checkpoint saving and loading, task rescheduling and restart, and task manual anomaly checks, which greatly harms the overall training efficiency.
1 code implementation • 19 Jun 2023 • Wenqi Jiang, Shigang Li, Yu Zhu, Johannes De Fine Licht, Zhenhao He, Runbin Shi, Cedric Renggli, Shuai Zhang, Theodoros Rekatsinas, Torsten Hoefler, Gustavo Alonso
Vector search has emerged as the foundation for large-scale information retrieval and machine learning systems, with search engines like Google and Bing processing tens of thousands of queries per second on petabyte-scale document datasets by evaluating vector similarities between encoded query texts and web documents.
2 code implementations • 8 May 2023 • Kazuki Osawa, Satoki Ishikawa, Rio Yokota, Shigang Li, Torsten Hoefler
Gradient preconditioning is a key technique to integrate the second-order information into gradients for improving and extending gradient-based learning algorithms.
no code implementations • 12 Apr 2023 • Heyu Chen, Jianfeng Li, Shigang Li
Direction estimation estimates the tilt angle of the panoramic image.
1 code implementation • 25 Nov 2022 • Kazuki Osawa, Shigang Li, Torsten Hoefler
Pipeline parallelism enables efficient training of Large Language Models (LLMs) on large-scale distributed accelerator clusters.
1 code implementation • 14 Sep 2022 • Shigang Li, Kazuki Osawa, Torsten Hoefler
We propose Magicube, a high-performance sparse-matrix library for low-precision integers on Tensor cores.
no code implementations • 3 Sep 2022 • Torsten Hoefler, Tommaso Bonato, Daniele De Sensi, Salvatore Di Girolamo, Shigang Li, Marco Heddes, Jon Belk, Deepak Goel, Miguel Castro, Steve Scott
Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the AI revolution.
1 code implementation • 19 Jan 2022 • Shigang Li, Torsten Hoefler
However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm and (2) the sparsification overhead.
1 code implementation • 20 Oct 2021 • Oliver Rausch, Tal Ben-Nun, Nikoli Dryden, Andrei Ivanov, Shigang Li, Torsten Hoefler
Rapid progress in deep learning is leading to a diverse set of quickly changing models, with a dramatically growing demand for compute.
1 code implementation • 14 Jul 2021 • Shigang Li, Torsten Hoefler
For a GPT-2 model with 1. 3 billion parameters running on 2, 048 GPU nodes of the Piz Daint supercomputer, Chimera improves the training throughput by 1. 16x-2. 34x over the state-of-the-art synchronous and asynchronous pipeline approaches.
1 code implementation • 30 Jun 2020 • Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, Torsten Hoefler
Transformers are one of the most important machine learning workloads today.
1 code implementation • 18 May 2020 • Peter Grönquist, Chengyuan Yao, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Shigang Li, Torsten Hoefler
Applied to global data, our mixed models achieve a relative improvement in ensemble forecast skill (CRPS) of over 14%.
no code implementations • 30 Apr 2020 • Shigang Li, Tal Ben-Nun, Giorgi Nadiradze, Salvatore Di Girolamo, Nikoli Dryden, Dan Alistarh, Torsten Hoefler
For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale.
no code implementations • 2 Nov 2019 • Peter Grönquist, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Luca Lavarini, Shigang Li, Torsten Hoefler
Modern weather forecast models perform uncertainty quantification using ensemble prediction systems, which collect nonparametric statistics based on multiple perturbed simulations.
no code implementations • NeurIPS 2021 • Giorgi Nadiradze, Amirmojtaba Sabour, Peter Davies, Shigang Li, Dan Alistarh
Perhaps surprisingly, we show that a variant of SGD called \emph{SwarmSGD} still converges in this setting, even if \emph{non-blocking communication}, \emph{quantization}, and \emph{local steps} are all applied \emph{in conjunction}, and even if the node data distributions and underlying graph topology are both \emph{heterogenous}.
no code implementations • 12 Aug 2019 • Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Dan Alistarh, Torsten Hoefler
Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself.