Abstract: Recent deep learning models can efficiently combine inputs from different modalities (e.g., images and text) and learn to align their latent representations or to translate signals from one ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results
Feedback