no code implementations • 7 Nov 2023 • Aaron Archer, Matthew Fahrbach, Kuikui Liu, Prakash Prabhu
We optimize pipeline parallelism for deep neural network (DNN) inference by partitioning model graphs into $k$ stages and minimizing the running time of the bottleneck stage, including communication.