In this talk, we discuss the challenges of building FLORES-101 a large-scale dataset for Machine Translation Evaluation: from sourcing the content to quality control. We analyze the performance of open-source models with this new benchmark and propose spBLEU, a generalization of BLEU using a multilingual spm model. We also and present a vision of how to improve model performance with an emphasis on low-resource and zero-shot directions using language-specific capacity and non-English-centric data.