1 code implementation • 19 Apr 2021 • Jack Kosaian, K. V. Rashmi
Algorithm-based fault tolerance (ABFT) is emerging as an efficient approach for fault tolerance in NNs.
no code implementations • 5 Apr 2021 • Kaige Liu, Jack Kosaian, K. V. Rashmi
We present ECRM, a DLRM training system that achieves efficient fault tolerance using erasure coding.
no code implementations • 2 May 2019 • Jack Kosaian, K. V. Rashmi, Shivaram Venkataraman
In order to scale to high query rates, prediction serving systems are run on many machines in cluster settings, and thus are prone to slowdowns and failures that inflate tail latency and cause violations of strict latency targets.
3 code implementations • 4 Jun 2018 • Jack Kosaian, K. V. Rashmi, Shivaram Venkataraman
To the best of our knowledge, this work proposes the first learning-based approach for designing codes, and also presents the first coding-theoretic solution that can provide resilience for any non-linear (differentiable) computation.
1 code implementation • 7 May 2015 • K. V. Rashmi, Ran Gilad-Bachrach
Multiple Additive Regression Trees (MART), an ensemble model of boosted regression trees, is known to deliver high prediction accuracy for diverse tasks, and it is widely used in practice.