Search Results for author: Mostafa Mahmoud

Found 7 papers, 0 papers with code

Schrödinger's FP: Dynamic Adaptation of Floating-Point Containers for Deep Learning Training

no code implementations28 Apr 2022 Miloš Nikolić, Enrique Torres Sanchez, Jiahui Wang, Ali Hadi Zadeh, Mostafa Mahmoud, Ameer Abdelhadi, Andreas Moshovos

We introduce a software-hardware co-design approach to reduce memory traffic and footprint during training with BFloat16 or FP32 boosting energy efficiency and execution time performance.

Mokey: Enabling Narrow Fixed-Point Inference for Out-of-the-Box Floating-Point Transformer Models

no code implementations23 Mar 2022 Ali Hadi Zadeh, Mostafa Mahmoud, Ameer Abdelhadi, Andreas Moshovos

Mokey reduces the footprint of state-of-the-art 32-bit or 16-bit floating-point transformer models by quantizing all values to 4-bit indexes into dictionaries of representative 16-bit fixed-point centroids.

Quantization

APack: Off-Chip, Lossless Data Compression for Efficient Deep Learning Inference

no code implementations21 Jan 2022 Alberto Delmas Lascorz, Mostafa Mahmoud, Andreas Moshovos

When integrated with a Tensorcore-based accelerator, APack boosts the speedup and energy efficiency to 1. 44X and 1. 37X respectively.

Data Compression Quantization

FPRaker: A Processing Element For Accelerating Neural Network Training

no code implementations15 Oct 2020 Omar Mohamed Awad, Mostafa Mahmoud, Isak Edo, Ali Hadi Zadeh, Ciaran Bannon, Anand Jayarajan, Gennady Pekhimenko, Andreas Moshovos

We demonstrate that FPRaker can be used to compose an accelerator for training and that it can improve performance and energy efficiency compared to using conventional floating-point units under ISO-compute area constraints.

Quantization

TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training and Inference

no code implementations1 Sep 2020 Mostafa Mahmoud, Isak Edo, Ali Hadi Zadeh, Omar Mohamed Awad, Gennady Pekhimenko, Jorge Albericio, Andreas Moshovos

TensorDash is a hardware level technique for enabling data-parallel MAC units to take advantage of sparsity in their input operand streams.

Laconic Deep Learning Computing

no code implementations10 May 2018 Sayeh Sharify, Mostafa Mahmoud, Alberto Delmas Lascorz, Milos Nikolic, Andreas Moshovos

A Laconic configuration that uses a 1K-wire weight memory interface, outperforms the 2K-wire conventional accelerator by 15. 4x and is 1. 95x more energy efficient.

2k Image Classification

Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How

no code implementations9 Mar 2018 Alberto Delmas, Patrick Judd, Dylan Malone Stuart, Zissis Poulos, Mostafa Mahmoud, Sayeh Sharify, Milos Nikolic, Andreas Moshovos

We show that, during inference with Convolutional Neural Networks (CNNs), more than 2x to $8x ineffectual work can be exposed if instead of targeting those weights and activations that are zero, we target different combinations of value stream properties.

Cannot find the paper you are looking for? You can Submit a new open access paper.