Machine translation (MT) deals with one important area, speech-to-speech translation, (S2ST). In this area, Google is a significant player. It introduced the S2ST system for the first time in 2019. An improved version was put in 2021. DeepMind researchers put a third iteration of S2ST in a paper published in May, 2023. It called Translatotron 3.
The preceding version Translatotron 2 too was very efficient.
The present version is unsupervised end-to-end model for direct speech-to-speech translatron.
This model is not trained in two languages as done conventionally. This model on its own finds consistent patterns and regularities in the given data. In the training phase, the model learns one language speech-text datasets. It relies on unsupervised cross lingual embeddings in both languages. These embeddings are mapped in shared space through self-learning.
Initially, the model learns the structure of both the languages separately. The learning is extended to find a common ground. It understands to link to and relate to the qualities of both the languages. It leads to cross-lingual embeddings which initializes a shared encoder. The encoder can handle both languages equally well.
Further improvements in the model are attributed to masked autoencoder. In encoding, this tool is provided for a part of the data. During the decoding stage, it has to infer or predict the hidden information. The model, in other words, is pushed into the guessing game.
Additionally, the model uses back-translation technique as a self-check. It ensures coherence and accuracy in translation.
Conventionally, S2ST used the pipeline of automatic speech recognition + machine translation (MT) + text-to-speech synthesis. Translatotron relies on different architecture. It maps source language speech to target language (there is no reliance on reliance on intermediate representation). It becomes more effective.
It also captures, so claim the researchers, the NVC gestures.