COLING 2022 (1) — Neural Machine Translation
Flitto is committed to providing highly-accurate multilingual data which has been collected through its own professional translation services across the multilingual spectrum. Such an endeavor is challenging and requires continuous learning in related NLP disciplines. Hence, we attended the recently held the 29th COLING 2022 (The International Conference on Computational Linguistics), which featured the world’s brightest minds in computational linguistics, to stay updated with emerging practices.
We spent a few productive days tuning in to well-presented sessions by accredited speakers. These intriguing papers inspire our team to craft the future of AI-powered translations.
Towards Robust Neural Machine Translation with Iterative Scheduled Data-Switch Training
Zhongjian Miao, Xiang Li, Liyan Kang, Wen Zhang, Chulun Zhou, Yidong Chen, Bin Wang, Min Zhang and Jinsong Su — Xiamen University, Xiaomi AI Lab, Harbin Institute of Technology, Pengcheng Lab
ZhongJian Miao and his fellow researchers from the Xiamen University and Harbin Institute of Technology examined existing training methods and proposed ways to improve neural machine translation (NMT) implementations.
To kick off the presentation, ZhongJian revisited existing methods for training NMT models, which involved concurrently training the model with both authentic and adversarial examples. According to ZhongJian, such methods may lead to subpar results and might not reflect the subtlety of noise considerations in actual applications.
The researchers confirmed their assumptions by testing several encoder and decoder outputs with the IWSLT’14 dataset for German-to-English translation. They evaluated the model’s performance based on SSR distance and model confidence.
During the study, ZhongJian compared three machine translation models, Transformer, Indis, and Switch, by running them through the same data set. The transformer model is a standard fixture in NLP practices and forms the basis for the Indisc-Model, which uses both adversarial and authentic samples simultaneously. Meanwhile, the Switch model only accepts authentic or adversarial samples at any particular iteration.
ZhongJian’s comparison test confirmed their assumptions, as the Switch model delivered better SSR Distance and Model Confidence scores. The researchers also discovered that initiating the training on the Switch model with adversarial samples produces a higher BLUE score in the shortest time.
Instead of adopting a constant noise ratio, ZhongJian ’s team proposed that the noise sampling process be held progressively over time. They tested the refined model against Transformer, Transformer-FT, Transformer-Mixed, and Transformer Indisc. Unsurprisingly, their NMT model proves superior across different test datasets.
Informative Language Representation Learning for Massively Multilingual Neural Machine Translation
Renren Jin and Deyi Xiong — Tianjin University
Multilingual translations have evolved to incorporate the usage of artificial language tokens, which enable zero-shot translations. Zero-shot translations pass the language information with the token instead of training the target language system with comprehensive datasets. This allows translations between language pairs with limited training resources.
In his paper, Tianjin University researcher Renren Jin noted the limitations of the language token approach. For example, he cited the off-target translation issue, where the multilingual translation model mistranslated certain parts of a sentence to the wrong language. So, Renren and his research advisor Deyi Xiong, set forth to resolve the inconsistencies.
During their study, they found that the target language information losses its integrity as it traverses through layers of translation mechanism. To overcome existing translation limitations, they proposed two new approaches, language embedding embodiment (LEE) and language-aware multi-head attention (LAA).
The language embedding embodiment (LEE) approach incorporates the translation information into multiple layers. It aims to strengthen the translation signal as the source data passes through various translation nodes. Meanwhile, the language-aware multi-head attention (LAA) mechanism takes a different approach by continuously introducing a language-specific matrix to improve language representations.
They tested LEE and LAA against existing multilingual translation models, specifically vanilla transformer, language-aware layer normalization (LALN), and language-aware linear transformation (LALT)
LEE generally performs better against the token-based approach in many-to-many translations when evaluated on the BLEU score and win rate. The study also showed that token-based transitions exhibit inconsistent performance on different datasets. The same inconsistency is prevalent when the token is perpended in the source or target direction.
The speaker also observed promising results with the LAA method. For example, embedding the LAA into the decoder’s self-attention layer resulted in remarkable improvement for both supervised and zero-shot translations. RenRen also noted that LAA demonstrated high accuracy for syntax feature inference in linguistic typology prediction.
Categorizing Semantic Representations for Neural Machine Translation
Yongjing Yin, Yafu Li, Fandong Meng, Jie Zhou and Yue Zhang — Zhejiang University, Westlake University
Despite making tremendous progress, neural machine translation is still hampered by notable challenges. Yongjing Yin and fellow researchers devoted their cause to solving compositional generalization issues in machine translation. Compositional generalization is the NLP model’s ability to learn possibly infinite combinations of words after learning the meaning of a singular word.
According to Yongjing, existing limitations occur partly due to the workflow between the encoder and decoder. During translations, the encoder will produce sequence-level contextualized representations from the source sentence, which forms the basis of the decoder’s target sentence. In this setup, the model was trained without disassociating the meaning of individual words from the complete sentence.
Consequently, source input samples are sparse at the sequence level. This affects the semantic composition during inference, which often results in out-of-distributions. To overcome the problem, Yongjing suggested decoupling token-level information from the source by injecting token-level translation distribution into the source representations.
In other words, Yongjing proposed creating multiple prototypes or clusters of contextualized representations for each token. For example, the token ‘he’ becomes a prototype consisting of different semantic representations, such as ‘he chose a toy car’ and ‘one day, he went to the dealership’. Then, the prototypes are connected to the encoder, forming the proto-transformer model. Yongjing theorized such an arrangement creates a rich set of prototypes for each token, effectively reducing the need for excessive sequence-level memorization.
The author has a firm conviction in the proto-transformer model and proved it in the subsequent experiment. He tested this model against other benchmark models. During the study, he evaluated the resulting BLEU score and compound translation error rate (CTER).
Yongjing advised against using too many prototypes as doing so leads to overfitting. The proto-transformer also proves effective with a single-pass configuration, where it achieves a marginally higher BLEU score against the conventional transformer model. On this note, the speaker cautioned the BLEU score might not always be an accurate indicator as it doesn’t reflect semantic distortion in some translations.
We have covered the insights about ‘Neural Machine Translation’ at COLING 2022. We will come back soon with the topic ‘Translation Aids’. Please follow our blog and stay tuned for the next topics.