Abstract: Recent deep learning models can efficiently combine inputs from different modalities (e.g., images and text) and learn to align their latent representations or to translate signals from one ...