1 code implementation • 15 Feb 2024 • Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, Igor Gitman
Building on the recent progress in open-source LLMs, our proposed prompting novelty, and some brute-force scaling, we construct OpenMathInstruct-1, a math instruction tuning dataset with 1. 8M problem-solution pairs.
Ranked #1 on Math Word Problem Solving on MAWPS (using extra training data)
no code implementations • 27 Jun 2023 • Igor Gitman, Vitaly Lavrukhin, Aleksandr Laptev, Boris Ginsburg
Second, we demonstrate that it is possible to combine base and adapted models to achieve strong results on both original and target data.
no code implementations • 18 Mar 2023 • Aleksandr Laptev, Vladimir Bataev, Igor Gitman, Boris Ginsburg
This paper presents a framework based on Weighted Finite-State Transducers (WFST) to simplify the development of modifications for RNN-Transducer (RNN-T) loss.
1 code implementation • NeurIPS 2019 • Igor Gitman, Hunter Lang, Pengchuan Zhang, Lin Xiao
The use of momentum in stochastic gradient methods has become a widespread practice in machine learning.
no code implementations • WS 2018 • Oleksii Kuchaiev, Boris Ginsburg, Igor Gitman, Vitaly Lavrukhin, Carl Case, Paulius Micikevicius
We present OpenSeq2Seq {--} an open-source toolkit for training sequence-to-sequence models.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
3 code implementations • 25 May 2018 • Oleksii Kuchaiev, Boris Ginsburg, Igor Gitman, Vitaly Lavrukhin, Jason Li, Huyen Nguyen, Carl Case, Paulius Micikevicius
We present OpenSeq2Seq - a TensorFlow-based toolkit for training sequence-to-sequence models that features distributed and mixed-precision training.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
1 code implementation • 28 Apr 2018 • Igor Gitman, Jieshi Chen, Eric Lei, Artur Dubrawski
In this paper we propose two novel approaches on how to solve this problem.
no code implementations • 9 Jan 2018 • Igor Gitman, Deepak Dilipkumar, Ben Parr
The basic idea of both of these algorithms is to make each step of the gradient descent proportional to the current weight norm and independent of the gradient magnitude.
no code implementations • ICLR 2018 • Boris Ginsburg, Igor Gitman, Yang You
Using LARS, we scaled AlexNet and ResNet-50 to a batch size of 16K.
no code implementations • 24 Sep 2017 • Igor Gitman, Boris Ginsburg
However, it is not clear if these algorithms could replace BN in practical, large-scale applications.
12 code implementations • 13 Aug 2017 • Yang You, Igor Gitman, Boris Ginsburg
Using LARS, we scaled Alexnet up to a batch size of 8K, and Resnet-50 to a batch size of 32K without loss in accuracy.