Revisiting transposed convolutions for interpreting raw waveform sound event recognition CNNs by sonification
The majority of recent work on the interpretability of audio and speech processing deep neural networks (DNNs) interprets spectral information modelled by the first layer, relying solely on visual means of interpretation. In this work, we propose \textit{sonification}, a method to interpret intermediate feature representations of sound event recognition (SER) convolutional neural networks (CNNs) trained on raw waveforms by mapping these representations back into the discrete-time input signal domain, highlighting substructures in the input that maximally activate a feature map as intelligible acoustic events. We use sonifications to compare supervised and self-supervised feature hierarchies and show how sonifications work synergistically with signal processing techniques and visual means of representation, aiding the interpretability of SER models.
PDF Abstract