There are two general approaches to audio processing with deep learning:
- Turn the audio file into an image, typically a log-scaled mel spectrogram wavelength.
- Process the data in a streaming form, usually in binary.
Convert Audio File into Image
Typically we convert to a log-scaled mel spectrogram as follows:
This creates an image that looks like this:
With this we can apply any neural network that might apply to an image, usually a Convolutional Neural Network (CNN). That is the purpose of the image transformation.