Course Glossaries

2. Machine Translation early modern and modern history

  • Machine Translation (MT): The process of automatically translating text or speech from one language to another using computer algorithms.

  • Rule-Based Machine Translation (RBMT): An MT approach that uses linguistic rules and bilingual dictionaries to translate text, focusing on syntax, morphology, and grammar.

  • Statistical Machine Translation (SMT): An MT approach that uses statistical models based on bilingual text corpora to predict the probability of a translation.

  • Neural Machine Translation (NMT): An advanced MT approach that uses deep learning models, specifically neural networks, to translate text by analyzing large datasets and capturing context.

  • Example-Based Machine Translation (EBMT): An MT approach that relies on a database of previously translated examples, finding the closest matches to translate new sentences.

  • Hybrid Machine Translation: A combination of different MT approaches, often integrating RBMT and SMT/NMT to leverage the strengths of each method.

  • Bilingual Text Corpora: Large collections of text in two languages, used to train and evaluate MT systems by providing parallel examples of translations.

  • Parallel Corpora: A type of bilingual corpus where texts in two languages are aligned at the sentence level, facilitating the training of SMT and NMT systems.

  • Phrase-Based Machine Translation: A specific type of SMT that breaks down text into phrases rather than individual words, improving the fluency of translations.

  • Sequence-to-Sequence (Seq2Seq) Model: A deep learning model used in NMT that processes sequences of text to generate translations, maintaining the order and context of words.

  • Translation Model: In SMT, a model that predicts the most likely translation of a word or phrase based on bilingual text data.

  • Language Model: A model that assesses the fluency of the translated text by predicting the likelihood of word sequences in the target language.

  • Decoding Algorithm: In MT, the process that selects the best translation hypothesis based on the probabilities generated by the translation and language models.

  • Reordering Model: A component in SMT that predicts the correct word order in the target language, addressing differences in syntax between languages.

  • Neural Networks: Computational models inspired by the human brain, used in NMT to learn patterns and relationships in language data.

  • Attention Mechanism: A technique in NMT that allows the model to focus on specific parts of the input sentence, improving translation accuracy, especially for long sentences.

  • Encoder-Decoder Architecture: A framework used in NMT where the encoder processes the input text and the decoder generates the translation, often using an attention mechanism.

  • BLEU Score (Bilingual Evaluation Understudy): A metric for evaluating the quality of machine-generated translations by comparing them to one or more reference translations.

  • Pre-trained Models: In NMT, models that have been trained on large datasets and can be fine-tuned for specific tasks or languages, speeding up the development process.

  • Transfer Learning: The practice of applying knowledge gained from one task (e.g., translating English to French) to another related task (e.g., translating English to Spanish), commonly used in NMT.

  • Back-Translation: A method in NMT training where target language data is translated back into the source language to create additional training data, improving translation quality.

  • Subword Units: Smaller language components, such as prefixes or suffixes, used in NMT to handle rare or compound words more effectively.

  • Tokenization: The process of breaking down text into smaller units, such as words or subwords, to facilitate processing in MT systems.

  • Alignment: The process of matching corresponding words or phrases between the source and target languages in a parallel corpus, crucial for training SMT and NMT systems.

  • Word Embeddings: Dense vector representations of words used in NMT to capture semantic meanings and relationships between words in different languages.

  • Domain Adaptation: The process of fine-tuning an MT system to perform better in a specific domain, such as legal or medical translation.

  • Cross-Lingual Transfer: The ability of an MT system to apply knowledge from one language pair to another, enhancing translation quality across multiple languages.

  • Multilingual Translation: An NMT approach that handles multiple languages simultaneously, using a shared model that can translate between any pair of supported languages.

  • Low-Resource Languages: Languages that have limited digital resources, such as corpora or dictionaries, posing challenges for MT development.

  • Out-of-Vocabulary (OOV) Words: Words that are not present in the training data of an MT system, often leading to translation errors.

  • Post-Editing: The process of manually correcting errors in machine-generated translations to improve accuracy and fluency.

  • Syntactic Parsing: The process of analyzing the grammatical structure of sentences, used in RBMT to generate accurate translations.

  • Morphological Analysis: The study of the structure of words and their components, such as roots and affixes, used in RBMT to handle inflected languages.

  • Lexical Disambiguation: The process of determining the correct meaning of a word that has multiple possible interpretations, crucial in MT for accurate translations.

  • Semantic Role Labeling: Identifying the roles played by words in a sentence, such as agent or object, to improve the accuracy of MT systems.

  • Language Pair: The combination of a source language and a target language in MT, such as English to Spanish.

  • Pivot Language: An intermediate language used in MT when direct translation between two languages is difficult due to lack of resources.

  • Contextual Embeddings: Word embeddings that take into account the context in which a word appears, improving translation quality in NMT.

  • Data Augmentation: The process of artificially increasing the size of a training dataset by creating variations of existing data, used to improve MT performance.

  • Beam Search: A decoding algorithm used in NMT that considers multiple translation hypotheses simultaneously to find the most probable translation.

  • Dropout: A regularization technique in NMT that prevents overfitting by randomly dropping units in the neural network during training.

  • Parallel Sentence Mining: The process of automatically finding and extracting parallel sentences from large bilingual corpora, used to improve the training of MT systems.

  • Translationese: A term referring to the distinct linguistic patterns that emerge in machine-generated translations, often detectable by statistical analysis.

  • Interactive MT: An MT approach where human translators interact with the MT system during the translation process, refining the output in real-time.

  • Corpus-Based MT: An approach that relies heavily on large text corpora for training MT systems, typical in SMT and NMT.

  • Phrase Table: In SMT, a table that lists possible translations for phrases in the source language along with their probabilities.

  • Cross-Entropy Loss: A loss function used in NMT training to measure the difference between the predicted translation and the actual translation.

  • Neural Language Model: A type of language model used in NMT that predicts the next word in a sentence based on the context of previous words.

  • Knowledge Distillation: A technique in NMT where a smaller, simpler model is trained to replicate the behavior of a larger, more complex model, improving efficiency.

  • Transfer-Based MT: An MT approach that transfers linguistic structures from the source language to the target language, relying on syntactic and semantic transfer rules.