Detailed experiment description
This is a detailed review of the experiment.
For a summarized review, see experiment overview.
The code for the entire experiment can be found in this GitHub project.
Data Preparation
Step 1: Each chorus we convert to WAV audio. We do this using the Python library Pydub. We export the data to a single channel 44100 sample rate WAV file, which we then load into a wavfile object using the SciPy library:
sound = AudioSegment.from_mp3(os.sep.join([AUDIO_FOLDER, audio_file])).set_channels(1)
audio_file_wave = sound.export(format="wav", bitrate=RATE)
sample_rate, samples = wavfile.read(audio_file_wave)
Step 2: Each WAV audio is converted to Mel-scale spectrogram with 128 frequency buckets. The mel scale is a scale of pitches which would be judged by most listeners to be equal in distance from one another. This is a better representation of the way a human hears pitches, unlike the natural representation of pitches by distance in Hz alone. I chose 128 frequency buckets because on the one hand, it is not too few, and can put different half-tones in different buckets (within the human hearing range), and on the other hand, it is a small enough number, which will allow us to reduce the number of nodes in our machine. Since the size of the input will be the number of buckets times the chunk size, fewer buckets translates to a smaller amount of calculation the machine has to eventually do when being trained.
For this conversion, we use a the PyTorch library, and specifically the MelSpectrogram transformator:
spectrogrammer = transforms.MelSpectrogram(sample_rate=RATE, n_fft=(MEL_SPECTROGRAM_BUCKETS * 2 - 2), win_length=MEL_SPECTROGRAM_WINDOW_LENGTH, power=2, normalized=True)
We then apply the transformer to each audio file thus:
spectogram = spectrogrammer(torch.from_numpy(samples/(2**15)).float().reshape((1, -1)))
Step 3: Each spectrogram will be divided to spectrogram chunks of 196 windows. This means each chunk matches exactly 0.5 seconds of the original audio. I did this for the sake of ease in calculation (working with 0.5 seconds), and because 0.5 is a long enough time to perceive a tone; while being not too long, to avoid too many emotion changes in a single chunk. Please notice that each chunk size is of 1 second, but represents 0.5 seconds of music, since the hop size is of 0.5 a second. When calculating the spectrograms, we take a window size of 224 samples per spectrogram. Since we have 44100 samples per second, we will have 196 spectrograms per second (44100/224 = 196). But since we have a hop size of 112, the sliding windows move just 0.5 second each time, so there is an overlap of 0.5 second between chunks.
Finally, each one second of audio, represented as a 128 X 196 spectrogram chunk, represents for us 0.5 seconds of the audio.
Neural Network Architecture
The architecture of the Neural Network can be divided into 3 sections - the CNN, the LSTM, and the ANN. The input of the machine is a spectrogram chunk of size 128 X 196 (196 frames of 128 frequency buckets).
The data then goes through two CNN tiers, each tear is composed of two convolutional layers and one pooling layer. Between each layer there is a ReLU gate.
The data then goes through three more CNN tiers, each tear is composed of one convolutional layer, a ReLU gate, one pooling layer and a Dropout layer with 0.25 ratio drop.
The data continues into an LSTM layer, and into 5 Linear layers to reduce the data size gradually to 2. After the first and second Linear layers there is a Dropout layer with 0.5 ratio drop.
The final activation function is the Identity, as the output resembles a regression problem.
The code for the machine, in PyTorch format:
class AudioLSTMCNN2(nn.Module):
def __init__(self, out_size: int = 2, cnn_channels: int = 64):
"""
For a spectrograms with 128 buckets and chunk size of 196, will be (128, 196)
"""
# call the parent constructor
super(AudioLSTMCNN2, self).__init__()
self.conv11 = nn.Conv2d(in_channels=1, out_channels=cnn_channels, kernel_size=(3, 3), stride=(1, 1), padding=1)
self.relu11 = nn.ReLU()
self.conv12 = nn.Conv2d(in_channels=cnn_channels, out_channels=cnn_channels, kernel_size=(3, 3), stride=(1, 1),
padding=1)
self.relu12 = nn.ReLU()
self.maxpool1 = nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2))
self.conv21 = nn.Conv2d(in_channels=cnn_channels, out_channels=cnn_channels, kernel_size=(3, 3), stride=(1, 1),
padding=1)
self.relu21 = nn.ReLU()
self.conv22 = nn.Conv2d(in_channels=cnn_channels, out_channels=cnn_channels, kernel_size=(3, 3), stride=(1, 1),
padding=1)
self.relu22 = nn.ReLU()
self.maxpool2 = nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2))
self.conv3 = nn.Conv2d(in_channels=cnn_channels, out_channels=cnn_channels * 2, kernel_size=(3, 3),
stride=(1, 1), padding=1)
self.relu3 = nn.ReLU()
self.maxpool3 = nn.MaxPool2d(kernel_size=(3, 3), stride=(3, 3))
self.dropout3 = nn.Dropout(p=0.25)
self.conv4 = nn.Conv2d(in_channels=cnn_channels * 2, out_channels=cnn_channels * 4, kernel_size=(3, 3),
stride=(1, 1), padding=1)
self.relu4 = nn.ReLU()
self.maxpool4 = nn.MaxPool2d(kernel_size=(3, 3), stride=(3, 3))
self.dropout4 = nn.Dropout(p=0.25)
self.conv5 = nn.Conv2d(in_channels=cnn_channels * 4, out_channels=cnn_channels * 4, kernel_size=(3, 3),
stride=(1, 1), padding=1)
self.relu5 = nn.ReLU()
self.maxpool5 = nn.MaxPool2d(kernel_size=(3, 3), stride=(3, 3))
self.dropout5 = nn.Dropout(p=0.25)
self.lstm6 = nn.LSTM(cnn_channels * 4, cnn_channels * 4) # , batch_first=True)
self.hidden = (torch.zeros(1, 1, cnn_channels * 4),
torch.zeros(1, 1, cnn_channels * 4))
self.fc6 = nn.Linear(in_features=cnn_channels * 4, out_features=cnn_channels * 2)
self.dropout6 = nn.Dropout(p=0.5)
self.fc7 = nn.Linear(in_features=cnn_channels * 2, out_features=cnn_channels)
self.dropout7 = nn.Dropout(p=0.5)
self.fc8 = nn.Linear(in_features=cnn_channels, out_features=cnn_channels//2)
self.fc9 = nn.Linear(in_features=cnn_channels//2, out_features=cnn_channels//4)
self.fc10 = nn.Linear(in_features=cnn_channels//4, out_features=out_size)
self.final = nn.Identity()
def forward(self, x):
x = x.reshape((1, 1, x.shape[0], -1))
x = self.conv11(x)
x = self.relu11(x)
x = self.conv12(x)
x = self.relu12(x)
x = self.maxpool1(x)
x = self.conv21(x)
x = self.relu21(x)
x = self.conv22(x)
x = self.relu22(x)
x = self.maxpool2(x)
x = self.conv3(x)
x = self.relu3(x)
x = self.maxpool3(x)
x = self.dropout3(x)
x = self.conv4(x)
x = self.relu4(x)
x = self.maxpool4(x)
x = self.dropout4(x)
x = self.conv5(x)
x = self.relu5(x)
x = self.maxpool5(x)
x = self.dropout5(x)
x = x.view(x.size(0), x.size(1), -1)
x = x.permute(0, 2, 1)
x, self.hidden = self.lstm6(x, self.hidden)
x = x.view(x.size(0), -1)
x = self.fc6(x)
x = self.dropout6(x)
x = self.fc7(x)
x = self.dropout7(x)
x = self.fc8(x)
x = self.fc9(x)
x = self.fc10(x)
final_x = self.final(x.reshape((-1)))
return final_x
The description of the machine and sizes:
Name |
Description and parameters |
Input size |
conv11 |
Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) |
1 X 1 X 128 X 196 |
relu11 |
ReLU() |
1 X 64 X 128 X 196 |
conv12 |
Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) |
1 X 64 X 128 X 196 |
relu12 |
ReLU() |
1 X 64 X 128 X 196 |
maxpool1 |
MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False) |
1 X 64 X 128 X 196 |
conv21 |
Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) |
1 X 64 X 64 X 98 |
relu21 |
ReLU() |
1 X 64 X 64 X 98 |
conv22 |
Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) |
1 X 64 X 64 X 98 |
relu22 |
ReLU() |
1 X 64 X 64 X 98 |
maxpool2 |
MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False) |
1 X 64 X 64 X 98 |
conv3 |
Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) |
1 X 64 X 32 X 49 |
relu3 |
ReLU() |
1 X 128 X 32 X 49 |
maxpool3 |
MaxPool2d(kernel_size=(3, 3), stride=(3, 3), padding=0, dilation=1, ceil_mode=False) |
1 X 128 X 32 X 49 |
dropout3 |
Dropout(p=0.25, inplace=False) |
1 X 128 X 10 X 16 |
conv4 |
Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) |
1 X 128 X 10 X 16 |
relu4 |
ReLU() |
1 X 256 X 10 X 16 |
maxpool4 |
MaxPool2d(kernel_size=(3, 3), stride=(3, 3), padding=0, dilation=1, ceil_mode=False) |
1 X 256 X 10 X 16 |
dropout4 |
Dropout(p=0.25, inplace=False) |
1 X 256 X 3 X 5 |
conv5 |
Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) |
1 X 256 X 3 X 5 |
relu5 |
ReLU() |
1 X 256 X 3 X 5 |
maxpool5 |
MaxPool2d(kernel_size=(3, 3), stride=(3, 3), padding=0, dilation=1, ceil_mode=False) |
1 X 256 X 3 X 5 |
dropout5 |
Dropout(p=0.25, inplace=False) |
1 X 256 X 1 X 1 |
lstm6 |
LSTM(256, 256) |
1 X 1 X 256 |
fc6 |
Linear(in_features=256, out_features=128, bias=True) |
1 X 256 |
dropout6 |
Dropout(p=0.5, inplace=False) |
1 X 128 |
fc7 |
Linear(in_features=128, out_features=64, bias=True) |
1 X 128 |
dropout7 |
Dropout(p=0.5, inplace=False) |
1 X 64 |
fc8 |
Linear(in_features=64, out_features=32, bias=True) |
1 X 64 |
fc9 |
Linear(in_features=32, out_features=16, bias=True) |
1 X 32 |
fc10 |
Linear(in_features=16, out_features=2, bias=True) |
1 X 16 |
final |
Identity() |
1 X 2 |
We use MSE to calculate the loss, because we want to use the distance between the truth and the hypothesis on the Thayer-model 2-dimensional axes.
We use the Adam optimizer. We start with a 0.0001 learning rate, which drops by a factor of 0.5 every 3 epochs that the mean loss is not improved.
We run the training for 10 epochs, with a batch size of 50 audio files.
The code:
model_c = AudioLSTMCNN2().cuda()
LEARNING_RATE = 0.0001
criterion = nn.MSELoss().cuda()
optimizer = torch.optim.Adam(model_c.parameters(), lr=LEARNING_RATE)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', patience = 3, factor=0.5)
start_time = time.time()
EPOCS = 6
PRINT_MARK = 500
BATCH_SIZE = 50
STOP_LOSS = 0.01
MIN_LEARNING_RATE = 0.000001
model_c.train()
model_c.hidden = (model_c.hidden[0].cuda(), model_c.hidden[1].cuda())
for epoc in range(EPOCS):
losses = list()
quadrants = list()
real_quadrants = list()
train_key_sample = [key for keys in [random.sample(trainset_quadrant_to_keys[i+1], BATCH_SIZE//4) for i in range(4)] for key in keys]
random.shuffle(train_key_sample)
train_sample = [datum for sample_key in train_key_sample for datum in trainset[sample_key]]
for batch_i, (X_train, (valence, arousal)) in enumerate(train_sample):
model_c.hidden = tuple([each.data for each in model_c.hidden])
optimizer.zero_grad()
y_train = torch.Tensor((valence, arousal)).cuda()
# Apply the model
y_pred = model_c(X_train.cuda()) # we don't flatten X-train here
loss = criterion(y_pred, y_train)
# Update parameters
loss.backward(retain_graph=True)
optimizer.step()
losses.append(loss.cpu().item())
real_quadrants.append(get_quadrant(y_train[0].item(), y_train[1].item()))
quadrants.append(get_quadrant(y_pred[0].item(), y_pred[1].item()))
# Print interim results
if (batch_i > 0 or epoc == 0) and batch_i%PRINT_MARK == 0:
print(f'{epoc:2}-{batch_i:4} | loss: {np.mean(losses):.5f} | [{quadrants.count(1):4}({real_quadrants.count(1):4}), {quadrants.count(2):4}({real_quadrants.count(2):4}), {quadrants.count(3):4}({real_quadrants.count(3):4}), {quadrants.count(4):4}({real_quadrants.count(4):4})] lr: {optimizer.param_groups[0]["lr"]}')
scheduler.step(np.mean(losses))
if np.mean(losses) < STOP_LOSS or optimizer.param_groups[0]["lr"] < MIN_LEARNING_RATE:
break
print(f'\nDuration: {time.time() - start_time:.0f} seconds') # print the time elapsed
Curated Testset
For creating the audio test files, we use the midiutil, PyDub and mido libraries. I’ve created an AudioData class which helps create audio files easily based on our required parameters (pitch, length, volume and beat).
For example, a script to create 25 seconds of a continuous A note:
one_tone_A = AudioData()
one_tone_A.add_sound([69], 0, 50, 100)
one_tone_A.save_to_wav(os.sep.join(["..", "data", "test_audio", "one_tone_A.wav"]))