This is the companion website for the ICML 2021 paper Relative Positional Encoding for Transformers with Linear Complexity by Antoine Liutkus, Ondřej Cífka, Shih-Lun Wu, Umut Şimşekli, Yi-Hsuan Yang and Gaël Richard.

GitHub repo stars


Source code

The source code is released on GitHub.


Pop piano generation

The following examples are generated from models trained for pop piano generation with sequence length 2048. We display outputs up to length 2560 to demonstrate extrapolation. APE refers to the Performer model with absolute positional encoding, the rest use our proposed sinusoidal and convolutional SPE (gated and ungated variants). We can clearly observe that the outputs from APE quickly become incoherent once the training length is exceeded (the time at which this happens depends on the tempo and style of each example). On the other hand, SPE-based models (especially ungated convolutional SPE) seem to extrapolate well.

Use the following checkbox to control whether the extrapolation (beyond length 2048) is displayed or not.


1 2 3 4 5 6 7 8 9 10

Sinusoidal SPE, gated

1 2 3 4 5 6 7 8 9 10

Sinusoidal SPE, ungated

1 2 3 4 5 6 7 8 9 10

Convolutional SPE, gated

1 2 3 4 5 6 7 8 9 10

Convolutional SPE, ungated

1 2 3 4 5 6 7 8 9 10

Groove continuation

The following examples are from models trained on the Groove2Groove accompaniment dataset. In each example, we prompt the model with 2 bars of a new accompaniment (unseen during training) and then let it generate a continuation. While the models were trained on segments of length 512 (corresponding to 2–10 bars), we let them generate 1024 tokens to test extrapolation. Again, SPE-based models are clearly superior to APE-based ones in terms of the quality of the generated samples, although the former also slightly degrade over time.

Both SPE-based models use the gated variant.