Conditional Neural Sequence Learners for Generating Drums’ Rhythms
Considering music as a sequence of events with multiple complex dependencies on various levels of a composition, the Long Short-Term Memory-based (LSTM) architectures have been proven to be very efficient in learning and reproducing musical styles. The “rampant force” of these architectures, however, makes them hardly useful for tasks that incorporate human input or generally constraints. Such an example is the generation of drums’ rhythms under a given metric structure (potentially combining different time signatures), with a given instrumentation (e.g. bass and guitar notes).
We present a solution that harnesses the LSTM sequence learner with a Feed-Forward (FF) part which is called the “Conditional Layer”. The LSTM and the FF layers influence (are merged into) a single layer making the final decision about the next drums’ event, given previous events (LSTM layer) and current constraints (FF layer). The resulting architecture is called the Conditional Neural Sequence Learner (CNSL) which is capable of learning long drum sequences, under constraints imposed by:
- Guitar Performance and Polyphony
- Bass Performance
- Tempo Information
- Metrical Structure
- Grouping(Phrasing) Information
The utilised corpus consists of 105 excerpts of musical pieces, featuring three bands with average length 2,5 minutes each, which were collected manually from web tablature learning sources. During the conversion to MIDI files, pre-processing was applied in order to maintain the drum, bass and single guitar tracks. For each song, selected characteristic snippets were marked as different phrases (e.g. couple, refrain). The phrasing selections and the annotations were conducted with the help of students of the Music Department of the Ionian University, Greece. The dataset is available upon request.
The dataset is separated to 3 parts with 35 pieces from each band plus 5 synthetic pieces with phrases including Toms and Cymbals events due to the fact they had very few appearances in the learning data. Band A (ABBA – denoted as AB) follows the “80s Disco” style while Band B (Pink Floyd – denoted as PF) follows “70s Blues and Psychedelic Rock” with Band C (Porcupine Tree – denoted as PT) having the most contemporary style featuring “Progressive Rock and Metal”.
The proposed architecture consists of two separate modules for predicting the next drum event. The LSTM module learns sequences of consecutive drum events, while the Feed-Forward module handles musical information regarding guitar, bass, metrical structure, tempo and grouping. This information is the sum of features of consecutive time-steps of the the Feed-Forward Input space within a moving window giving information about the past, current and future time-steps. The length of the window depends of the memory of the LSTMs (Sequence Length) and has value of 2 ∗ Sequence Length + 1 time-steps. Since most of the training examples are in 4/4 Time Signature, the memory of the LSTMs is predefined with the fixed value of 16 time-steps corresponding to a full bar. For our experiments we use two variations of the proposed architecture depending on the different LSTM representation input data. The first one, denoted as Arch1, is the same as it was used on our prior work and it consists of a single LSTM Layer block merged with the Feed-Forward Layer on a single softmax output. The second variation, Arch2, consists of 4 input spaces with 3 LSTM Layer Blocks, each one responsible for different drum’s Inputs and the Feed-Forward Layer combined into different Merge Layers thus leading to independent softmax outputs.
As far as the configuration for LSTM Blocks is concerned, we use two stacked LSTM layers with 256 Hidden Units and a dropout of 0.2. Accordingly, the LSTMs attempt to predict the next step in a stochastic manner. The Feed-Forward Layer has 256 Hidden Units and its output is fed into Merge Layers, along with the output of the LSTM blocks. Finally, each Merge Layer then passes through the softmax nonlinearity to generate the probability distribution of the prediction. During the optimisation phase, the Adam algorithm is used while the loss function is calculated as the mean of the (categorical) cross-entropy between the prediction and the target of each output. For our experiments, we used the Tensorflow deep learning framework and Keras library.
The proposed CNSL architecture is able to generate sequences that reflect characteristics of the training data; however one can suspect that the Conditional Layer might have no impact, or even to jeopardise the ability of the LSTM to generate distinguishably different music styles. To this end, networks trained in the style of AB, PF and PT, in both architecture variations Arch1 and Arch2, are employed to compose drums’ sequences using the 9 test tracks (three for each band) which were not included in the training dataset. To validate the effectiveness of the Conditional Layer, drums’ rhythms generated by systems trained in a certain style are compared with the same drums’ rhythms generated from systems trained in different learnt styles in order to examine the ability of a network to imitate a learnt style. Below you can listen to some snippets.
- AB1 with sparse Time Signature changes (2/4 and 4/4) and high tempo. Generated with Arch2
- AB2 with 4/4 Time Signature and high tempo. Generated with Arch2
- AB3 with 4/4 Time Signature and very high tempo. Generated with Arch1
- PF1 with 4/4 Time Signature and low tempo. Generated with Arch2
- PF2 with 4/4 Time Signature and very low tempo. Generated with Arch1
- PF3 with 4/4 Time Signature and moderate tempo. Generated with Arch2
- PT1 with 7/8 Time Signature and moderate tempo. Generated with Arch1
- PT2 with continuous Time Signature changes (4/4, 3/8, 5/8 and 7/8) and high tempo. Generated with Arch2
- PT3 with sparse Time Signature changes (17/16, and 4/4) and low tempo. Generated with Arch2
In addition we tried some different combinations with networks trained in a certain band trying to generate rhythms from other style.
- AB1 in the style of PF. Generated with Arch2
- AB2 in the style of PT. Generated with Arch1
- PF1 in the style of PT. Generated with Arch2
- PF3 in the style of AB. Generated with Arch1
- PT2 in the style of AB. Generated with Arch1
- PT2 in the style of PF. Generated with Arch1
Finally you can listen to a compilation of an early version of the CNSL Drum Generator using training data in the style of PT and PF .
Please read and cite our work if you like: