Aimpathy

Amit Yungman

Much research has been done to better understand the emotional experience of music; from the philosophical, artistic, psychological, and statistical approaches. In this research we conduct a cross-domain experiment based on those four disciplines, to further understand the factors that influence the emotional perception of music; and in particular the difference between the artist’s emotional conception and the audience’s perception. In the experiment we train a novel model of an Artificial Neural Network, to predict the perceived emotion from a short musical phrase. We then feed the machine curated input, which simulates artistic choices, to explore its most significant factors in determining the perceived emotions. In the conclusion we describe the results, as well as the possible follow-ups to the experiment, such as an emotional expression training tool for musicians.

Detailed experiment description

This is a detailed review of the experiment.
For a summarized review, see experiment overview.

The code for the entire experiment can be found in this GitHub project.

Dataset

Data Preparation

Step 1: Each chorus we convert to WAV audio. We do this using the Python library Pydub. We export the data to a single channel 44100 sample rate WAV file, which we then load into a wavfile object using the SciPy library:

sound = AudioSegment.from_mp3(os.sep.join([AUDIO_FOLDER, audio_file])).set_channels(1)

audio_file_wave = sound.export(format="wav", bitrate=RATE)

sample_rate, samples = wavfile.read(audio_file_wave)

Step 2: Each WAV audio is converted to Mel-scale spectrogram with 128 frequency buckets. The mel scale is a scale of pitches which would be judged by most listeners to be equal in distance from one another. This is a better representation of the way a human hears pitches, unlike the natural representation of pitches by distance in Hz alone. I chose 128 frequency buckets because on the one hand, it is not too few, and can put different half-tones in different buckets (within the human hearing range), and on the other hand, it is a small enough number, which will allow us to reduce the number of nodes in our machine. Since the size of the input will be the number of buckets times the chunk size, fewer buckets translates to a smaller amount of calculation the machine has to eventually do when being trained.

For this conversion, we use a the PyTorch library, and specifically the MelSpectrogram transformator:

spectrogrammer = transforms.MelSpectrogram(sample_rate=RATE, n_fft=(MEL_SPECTROGRAM_BUCKETS * 2 - 2), win_length=MEL_SPECTROGRAM_WINDOW_LENGTH, power=2, normalized=True)

We then apply the transformer to each audio file thus:

spectogram = spectrogrammer(torch.from_numpy(samples/(2**15)).float().reshape((1, -1)))

Step 3: Each spectrogram will be divided to spectrogram chunks of 196 windows. This means each chunk matches exactly 0.5 seconds of the original audio. I did this for the sake of ease in calculation (working with 0.5 seconds), and because 0.5 is a long enough time to perceive a tone; while being not too long, to avoid too many emotion changes in a single chunk. Please notice that each chunk size is of 1 second, but represents 0.5 seconds of music, since the hop size is of 0.5 a second. When calculating the spectrograms, we take a window size of 224 samples per spectrogram. Since we have 44100 samples per second, we will have 196 spectrograms per second (44100/224 = 196). But since we have a hop size of 112, the sliding windows move just 0.5 second each time, so there is an overlap of 0.5 second between chunks.

Finally, each one second of audio, represented as a 128 X 196 spectrogram chunk, represents for us 0.5 seconds of the audio.

Step 4: Each spectrogram chunk is matched with its appropriate Thayer-model tagging. Since each spectrogram now represents exactly 0.5 seconds of audio, and our dataset is tagged per 0.5 seconds, we can easily match the dataset’s tagging to our spectrograms.

Neural Network Architecture

The architecture of the Neural Network can be divided into 3 sections - the CNN, the LSTM, and the ANN. The input of the machine is a spectrogram chunk of size 128 X 196 (196 frames of 128 frequency buckets).

The data then goes through two CNN tiers, each tear is composed of two convolutional layers and one pooling layer. Between each layer there is a ReLU gate.

The data then goes through three more CNN tiers, each tear is composed of one convolutional layer, a ReLU gate, one pooling layer and a Dropout layer with 0.25 ratio drop.

The data continues into an LSTM layer, and into 5 Linear layers to reduce the data size gradually to 2. After the first and second Linear layers there is a Dropout layer with 0.5 ratio drop.

The final activation function is the Identity, as the output resembles a regression problem.

The code for the machine, in PyTorch format:

class AudioLSTMCNN2(nn.Module):
  def __init__(self, out_size: int = 2, cnn_channels: int = 64):
     """
     For a spectrograms with 128 buckets and chunk size of 196, will be (128, 196)
     """
     # call the parent constructor
     super(AudioLSTMCNN2, self).__init__()

     self.conv11 = nn.Conv2d(in_channels=1, out_channels=cnn_channels, kernel_size=(3, 3), stride=(1, 1), padding=1)
     self.relu11 = nn.ReLU()
     self.conv12 = nn.Conv2d(in_channels=cnn_channels, out_channels=cnn_channels, kernel_size=(3, 3), stride=(1, 1),
                     padding=1)
     self.relu12 = nn.ReLU()
     self.maxpool1 = nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2))

     self.conv21 = nn.Conv2d(in_channels=cnn_channels, out_channels=cnn_channels, kernel_size=(3, 3), stride=(1, 1),
                     padding=1)
     self.relu21 = nn.ReLU()
     self.conv22 = nn.Conv2d(in_channels=cnn_channels, out_channels=cnn_channels, kernel_size=(3, 3), stride=(1, 1),
                     padding=1)
     self.relu22 = nn.ReLU()
     self.maxpool2 = nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2))

     self.conv3 = nn.Conv2d(in_channels=cnn_channels, out_channels=cnn_channels * 2, kernel_size=(3, 3),
                    stride=(1, 1), padding=1)
     self.relu3 = nn.ReLU()
     self.maxpool3 = nn.MaxPool2d(kernel_size=(3, 3), stride=(3, 3))
     self.dropout3 = nn.Dropout(p=0.25)

     self.conv4 = nn.Conv2d(in_channels=cnn_channels * 2, out_channels=cnn_channels * 4, kernel_size=(3, 3),
                    stride=(1, 1), padding=1)
     self.relu4 = nn.ReLU()
     self.maxpool4 = nn.MaxPool2d(kernel_size=(3, 3), stride=(3, 3))
     self.dropout4 = nn.Dropout(p=0.25)

     self.conv5 = nn.Conv2d(in_channels=cnn_channels * 4, out_channels=cnn_channels * 4, kernel_size=(3, 3),
                    stride=(1, 1), padding=1)
     self.relu5 = nn.ReLU()
     self.maxpool5 = nn.MaxPool2d(kernel_size=(3, 3), stride=(3, 3))
     self.dropout5 = nn.Dropout(p=0.25)

     self.lstm6 = nn.LSTM(cnn_channels * 4, cnn_channels * 4) # , batch_first=True)
     self.hidden = (torch.zeros(1, 1, cnn_channels * 4),
               torch.zeros(1, 1, cnn_channels * 4))
     self.fc6 = nn.Linear(in_features=cnn_channels * 4, out_features=cnn_channels * 2)
     self.dropout6 = nn.Dropout(p=0.5)

     self.fc7 = nn.Linear(in_features=cnn_channels * 2, out_features=cnn_channels)
     self.dropout7 = nn.Dropout(p=0.5)

     self.fc8 = nn.Linear(in_features=cnn_channels, out_features=cnn_channels//2)
     self.fc9 = nn.Linear(in_features=cnn_channels//2, out_features=cnn_channels//4)
     self.fc10 = nn.Linear(in_features=cnn_channels//4, out_features=out_size)
     self.final = nn.Identity()

  def forward(self, x):
     x = x.reshape((1, 1, x.shape[0], -1))

     x = self.conv11(x)
     x = self.relu11(x)
     x = self.conv12(x)
     x = self.relu12(x)
     x = self.maxpool1(x)

     x = self.conv21(x)
     x = self.relu21(x)
     x = self.conv22(x)
     x = self.relu22(x)
     x = self.maxpool2(x)

     x = self.conv3(x)
     x = self.relu3(x)
     x = self.maxpool3(x)
     x = self.dropout3(x)

     x = self.conv4(x)
     x = self.relu4(x)
     x = self.maxpool4(x)
     x = self.dropout4(x)

     x = self.conv5(x)
     x = self.relu5(x)
     x = self.maxpool5(x)
     x = self.dropout5(x)

     x = x.view(x.size(0), x.size(1), -1)
     x = x.permute(0, 2, 1)

     x, self.hidden = self.lstm6(x, self.hidden)

     x = x.view(x.size(0), -1)
     x = self.fc6(x)
     x = self.dropout6(x)

     x = self.fc7(x)
     x = self.dropout7(x)

     x = self.fc8(x)
     x = self.fc9(x)
     x = self.fc10(x)

     final_x = self.final(x.reshape((-1)))

     return final_x

The description of the machine and sizes:

Name	Description and parameters	Input size
conv11	Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))	1 X 1 X 128 X 196
relu11	ReLU()	1 X 64 X 128 X 196
conv12	Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))	1 X 64 X 128 X 196
relu12	ReLU()	1 X 64 X 128 X 196
maxpool1	MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)	1 X 64 X 128 X 196
conv21	Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))	1 X 64 X 64 X 98
relu21	ReLU()	1 X 64 X 64 X 98
conv22	Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))	1 X 64 X 64 X 98
relu22	ReLU()	1 X 64 X 64 X 98
maxpool2	MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)	1 X 64 X 64 X 98
conv3	Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))	1 X 64 X 32 X 49
relu3	ReLU()	1 X 128 X 32 X 49
maxpool3	MaxPool2d(kernel_size=(3, 3), stride=(3, 3), padding=0, dilation=1, ceil_mode=False)	1 X 128 X 32 X 49
dropout3	Dropout(p=0.25, inplace=False)	1 X 128 X 10 X 16
conv4	Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))	1 X 128 X 10 X 16
relu4	ReLU()	1 X 256 X 10 X 16
maxpool4	MaxPool2d(kernel_size=(3, 3), stride=(3, 3), padding=0, dilation=1, ceil_mode=False)	1 X 256 X 10 X 16
dropout4	Dropout(p=0.25, inplace=False)	1 X 256 X 3 X 5
conv5	Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))	1 X 256 X 3 X 5
relu5	ReLU()	1 X 256 X 3 X 5
maxpool5	MaxPool2d(kernel_size=(3, 3), stride=(3, 3), padding=0, dilation=1, ceil_mode=False)	1 X 256 X 3 X 5
dropout5	Dropout(p=0.25, inplace=False)	1 X 256 X 1 X 1
lstm6	LSTM(256, 256)	1 X 1 X 256
fc6	Linear(in_features=256, out_features=128, bias=True)	1 X 256
dropout6	Dropout(p=0.5, inplace=False)	1 X 128
fc7	Linear(in_features=128, out_features=64, bias=True)	1 X 128
dropout7	Dropout(p=0.5, inplace=False)	1 X 64
fc8	Linear(in_features=64, out_features=32, bias=True)	1 X 64
fc9	Linear(in_features=32, out_features=16, bias=True)	1 X 32
fc10	Linear(in_features=16, out_features=2, bias=True)	1 X 16
final	Identity()	1 X 2

We use MSE to calculate the loss, because we want to use the distance between the truth and the hypothesis on the Thayer-model 2-dimensional axes.
We use the Adam optimizer. We start with a 0.0001 learning rate, which drops by a factor of 0.5 every 3 epochs that the mean loss is not improved.
We run the training for 10 epochs, with a batch size of 50 audio files.

The code:

model_c = AudioLSTMCNN2().cuda()
LEARNING_RATE = 0.0001
criterion = nn.MSELoss().cuda()
optimizer = torch.optim.Adam(model_c.parameters(), lr=LEARNING_RATE)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', patience = 3, factor=0.5)
start_time = time.time()
EPOCS = 6
PRINT_MARK = 500
BATCH_SIZE = 50
STOP_LOSS = 0.01
MIN_LEARNING_RATE = 0.000001

model_c.train()
model_c.hidden = (model_c.hidden[0].cuda(), model_c.hidden[1].cuda())
for epoc in range(EPOCS):
  losses = list()
  quadrants = list()
  real_quadrants = list()
  train_key_sample = [key for keys in [random.sample(trainset_quadrant_to_keys[i+1], BATCH_SIZE//4) for i in range(4)] for key in keys]
  random.shuffle(train_key_sample)
  train_sample = [datum for sample_key in train_key_sample for datum in trainset[sample_key]]

  for batch_i, (X_train, (valence, arousal)) in enumerate(train_sample):
    model_c.hidden = tuple([each.data for each in model_c.hidden])

    optimizer.zero_grad()

    y_train = torch.Tensor((valence, arousal)).cuda()
    # Apply the model
    y_pred = model_c(X_train.cuda()) # we don't flatten X-train here
    loss = criterion(y_pred, y_train)

    # Update parameters
    loss.backward(retain_graph=True)
    optimizer.step()

    losses.append(loss.cpu().item())
    real_quadrants.append(get_quadrant(y_train[0].item(), y_train[1].item()))
    quadrants.append(get_quadrant(y_pred[0].item(), y_pred[1].item()))

    # Print interim results
    if (batch_i > 0 or epoc == 0) and batch_i%PRINT_MARK == 0:
      print(f'{epoc:2}-{batch_i:4} | loss: {np.mean(losses):.5f} | [{quadrants.count(1):4}({real_quadrants.count(1):4}), {quadrants.count(2):4}({real_quadrants.count(2):4}), {quadrants.count(3):4}({real_quadrants.count(3):4}), {quadrants.count(4):4}({real_quadrants.count(4):4})]   lr: {optimizer.param_groups[0]["lr"]}')

  scheduler.step(np.mean(losses))

  if np.mean(losses) < STOP_LOSS or optimizer.param_groups[0]["lr"] < MIN_LEARNING_RATE:
    break

print(f'\nDuration: {time.time() - start_time:.0f} seconds') # print the time elapsed

Curated Testset

For creating the audio test files, we use the midiutil, PyDub and mido libraries. I’ve created an AudioData class which helps create audio files easily based on our required parameters (pitch, length, volume and beat).

For example, a script to create 25 seconds of a continuous A note:

one_tone_A = AudioData()
one_tone_A.add_sound([69], 0, 50, 100)
one_tone_A.save_to_wav(os.sep.join(["..", "data", "test_audio", "one_tone_A.wav"]))

Tip

Tip

Amit Yungman - Aimpathy - 2023