I will talk about two perspectives on learning from multilingual multimodal data: as a language generation problem and as cross-modal retrieval problem. In the language generation problem of multimodal machine translation, I will discuss whether we should learn grounded representations by using the additional visual context as a conditioning input or as a variable that the model learns to predict, and highlight some recent arguments about whether models are actually sensitive to the visual context. As a multilingual image–sentence retrieval problem, I will discuss experiments that highlight situations in which it is useful to train with multilingual annotations, as opposed to monolingual annotations, and the challenges in learning from disjoint cross-lingual datasets.