Search Results for author: Shigang Li

Found 16 papers, 10 papers with code

TRANSOM: An Efficient Fault-Tolerant System for Training LLMs

1 code implementation • 16 Oct 2023 • Baodong Wu, Lei Xia, Qingping Li, Kangyu Li, Xu Chen, Yongqiang Guo, Tieyao Xiang, YuHeng Chen, Shigang Li

As a result, A substantial amount of training time is devoted to task checkpoint saving and loading, task rescheduling and restart, and task manual anomaly checks, which greatly harms the overall training efficiency.

Anomaly Detection

Paper
Code

Co-design Hardware and Algorithm for Vector Search

1 code implementation • 19 Jun 2023 • Wenqi Jiang, Shigang Li, Yu Zhu, Johannes De Fine Licht, Zhenhao He, Runbin Shi, Cedric Renggli, Shuai Zhang, Theodoros Rekatsinas, Torsten Hoefler, Gustavo Alonso

Vector search has emerged as the foundation for large-scale information retrieval and machine learning systems, with search engines like Google and Bing processing tens of thousands of queries per second on petabyte-scale document datasets by evaluating vector similarities between encoded query texts and web documents.

Information Retrieval Retrieval

Paper
Code

ASDL: A Unified Interface for Gradient Preconditioning in PyTorch

2 code implementations • 8 May 2023 • Kazuki Osawa, Satoki Ishikawa, Rio Yokota, Shigang Li, Torsten Hoefler

Gradient preconditioning is a key technique to integrate the second-order information into gradients for improving and extending gradient-based learning algorithms.

173

Paper
Code

An End-to-End Network for Upright Adjustment of Panoramic Images

no code implementations • 12 Apr 2023 • Heyu Chen, Jianfeng Li, Shigang Li

Direction estimation estimates the tilt angle of the panoramic image.

Attribute Generative Adversarial Network +1

Paper
Add Code

PipeFisher: Efficient Training of Large Language Models Using Pipelining and Fisher Information Matrices

1 code implementation • 25 Nov 2022 • Kazuki Osawa, Shigang Li, Torsten Hoefler

Pipeline parallelism enables efficient training of Large Language Models (LLMs) on large-scale distributed accelerator clusters.

Paper
Code

Efficient Quantized Sparse Matrix Operations on Tensor Cores

1 code implementation • 14 Sep 2022 • Shigang Li, Kazuki Osawa, Torsten Hoefler

We propose Magicube, a high-performance sparse-matrix library for low-precision integers on Tensor cores.

Quantization

Paper
Code

HammingMesh: A Network Topology for Large-Scale Deep Learning

no code implementations • 3 Sep 2022 • Torsten Hoefler, Tommaso Bonato, Daniele De Sensi, Salvatore Di Girolamo, Shigang Li, Marco Heddes, Jon Belk, Deepak Goel, Miguel Castro, Steve Scott

Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the AI revolution.

Scheduling

Paper
Add Code

Near-Optimal Sparse Allreduce for Distributed Deep Learning

1 code implementation • 19 Jan 2022 • Shigang Li, Torsten Hoefler

However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm and (2) the sparsification overhead.

Paper
Code

A Data-Centric Optimization Framework for Machine Learning

1 code implementation • 20 Oct 2021 • Oliver Rausch, Tal Ben-Nun, Nikoli Dryden, Andrei Ivanov, Shigang Li, Torsten Hoefler

Rapid progress in deep learning is leading to a diverse set of quickly changing models, with a dramatically growing demand for compute.

BIG-bench Machine Learning

Paper
Code

Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines

1 code implementation • 14 Jul 2021 • Shigang Li, Torsten Hoefler

For a GPT-2 model with 1. 3 billion parameters running on 2, 048 GPU nodes of the Piz Daint supercomputer, Chimera improves the training throughput by 1. 16x-2. 34x over the state-of-the-art synchronous and asynchronous pipeline approaches.

Scheduling

Paper
Code

Data Movement Is All You Need: A Case Study on Optimizing Transformers

1 code implementation • 30 Jun 2020 • Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, Torsten Hoefler

Transformers are one of the most important machine learning workloads today.

Language Modelling

121

Paper
Code

Deep Learning for Post-Processing Ensemble Weather Forecasts

1 code implementation • 18 May 2020 • Peter Grönquist, Chengyuan Yao, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Shigang Li, Torsten Hoefler

Applied to global data, our mixed models achieve a relative improvement in ensemble forecast skill (CRPS) of over 14%.

Weather Forecasting

Paper
Code

Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

no code implementations • 30 Apr 2020 • Shigang Li, Tal Ben-Nun, Giorgi Nadiradze, Salvatore Di Girolamo, Nikoli Dryden, Dan Alistarh, Torsten Hoefler

For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale.

Machine Translation reinforcement-learning +3

Paper
Add Code

Predicting Weather Uncertainty with Deep Convnets

no code implementations • 2 Nov 2019 • Peter Grönquist, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Luca Lavarini, Shigang Li, Torsten Hoefler

Modern weather forecast models perform uncertainty quantification using ensemble prediction systems, which collect nonparametric statistics based on multiple perturbed simulations.

Uncertainty Quantification Weather Forecasting

Paper
Add Code

Asynchronous Decentralized SGD with Quantized and Local Updates

no code implementations • NeurIPS 2021 • Giorgi Nadiradze, Amirmojtaba Sabour, Peter Davies, Shigang Li, Dan Alistarh

Perhaps surprisingly, we show that a variant of SGD called \emph{SwarmSGD} still converges in this setting, even if \emph{non-blocking communication}, \emph{quantization}, and \emph{local steps} are all applied \emph{in conjunction}, and even if the node data distributions and underlying graph topology are both \emph{heterogenous}.

Blocking Distributed Optimization +2

Paper
Add Code

Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

no code implementations • 12 Aug 2019 • Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Dan Alistarh, Torsten Hoefler

Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.