3. Digital Humanities and languages for specific purposes: Corpus and Computational Linguistics

3. Digital Humanities and languages for specific purposes

3. Terminology in modern multidisciplinary paradigms and applied fields

3.2. Corpus and Computational Linguistics

This involves analysing large bodies of text (corpora) to study the frequency, usage patterns, and contexts of terms. It helps in understanding how language is used in real-life settings, which is essential for accurate term translation and usage.

Corpus linguistics identifies samples of real-world text and has made significant contributions to terminological studies and the development of terminology management instruments. Here's an overview of these contributions:

Term Extraction and Identification:

Corpus linguistics enables the automatic extraction of potential terms from large volumes of text. This is particularly useful for identifying specialized vocabulary in specific fields or domains.

Understanding Contextual Usage:

By analysing how terms are used in context, corpus linguistics helps in understanding the nuanced meanings of terms, their connotations, and the circumstances of their use.

Terminology Standardization:

Corpora can provide evidence for preferred or more frequent usage of terms within a community, aiding in the process of standardizing terminology across a field.

Terminology Database Development:

Insights gained from corpus analysis are used to develop and enrich terminological databases, ensuring that they are up-to-date and reflect current usage.

Multilingual Terminology Work:

Multilingual corpora allow for the comparison of terms across languages, aiding in the process of translation and the creation of bilingual or multilingual terminological resources.

Development of Glossaries and Dictionaries:

Corpus linguistics provides empirical data for the creation of specialized glossaries and dictionaries, which are essential tools in terminology management.

Language Variation and Change:

Analysing corpora over time helps in tracking changes in language and terminology, which is crucial for keeping terminological resources relevant and accurate.

Semantic Analysis:

The study of corpora helps in understanding the relationships between terms and concepts (semantic fields), which is important for organizing and structuring terminological knowledge.

Quality Control in Translation:

Corpora are used to ensure the consistency and accuracy of terminology in translation work, which is a key aspect of quality control in this field.

Training and Education:

Corpus-based studies are used in training translators, interpreters, and terminology managers, providing them with insights into practical, real-world language use.

Customized Corpus Development:

Specific corpora can be developed for particular fields or projects, providing tailored resources for terminological analysis and management.

Supporting Natural Language Processing (NLP) Applications:

In general, computational linguistics may provide a wide range of tools that contribute to terminological management. For example, we can compare and contrast collocations, represented by national corpora, and find correspondent terminological expressions (see Pic.1)

Picture 1. Collocations of ‘chemical’ and ‘химический’ from Russian and American corpora.

Corpus linguistics contributes to the development of NLP applications, including automated term recognition and extraction tools, which are increasingly important in terminology management.

The data collected within corpus-based approach are widely used by another direction of DH that is NLP. Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on enabling computers to understand, interpret, and respond to human language in a valuable way. NLP combines computational linguistics – rule-based modelling of human language with statistical, machine learning, and deep learning models. These technologies enable computers to process human language in the form of text or voice data and to 'understand' its full meaning, complete with the speaker or writer's intent and sentiment.

NLP has a wide range of applications, including text translation, sentiment analysis, speech recognition, chatbots, search engines, text summarization, and much more.

Let us take sentimental analysis as an example and study the process of NLP.

Sentiment Analysis, also known as opinion mining, is a field within Natural Language Processing (NLP) that focuses on identifying and categorizing opinions expressed in text, especially to determine whether the writer's attitude towards a particular topic, product, etc., is positive, negative, or neutral. This technique is widely used for understanding customer sentiments in reviews, social media posts, and other textual content. Here's a more detailed overview:

How Sentiment Analysis Works:

Text Processing: Involves cleaning and preparing text data for analysis, including tasks like tokenization (breaking text into words or phrases), removing stop words (common words that don't contribute to the meaning), and stemming or lemmatization (reducing words to their base form).

Feature Extraction: Transforming processed text into a format that machine learning algorithms can understand, often using techniques like bag-of-words or word embeddings.

Sentiment Classification: Applying machine learning or deep learning algorithms to classify the sentiment of the text. This can be a binary classification (positive or negative), ternary (positive, negative, neutral), or even on a scale (very positive, somewhat positive, neutral, somewhat negative, very negative).

Context and Tone Understanding: Advanced sentiment analysis involves understanding context and tone, which can be challenging as it requires the algorithm to recognize things like sarcasm, irony, or subtlety.

Applications:

Business and Marketing: Analysing customer feedback, reviews, and social media posts to gauge public opinion about products or services.

Politics: Monitoring public opinion on political issues, campaigns, or politicians.

Finance: Sentiment analysis of news articles, reports, or social media to predict stock market trends.

Healthcare: Analysing patient feedback, responses, and reviews about treatments or healthcare services.

Customer Service: Automating responses and prioritizing customer queries based on sentiment.

Tools and Technologies:

Python Libraries: NLTK, TextBlob, spaCy, and scikit-learn offer tools for sentiment analysis.

APIs and Platforms: Google Cloud Natural Language, IBM Watson, and Amazon Comprehend provide sentiment analysis as part of their NLP services.

Deep Learning Frameworks: TensorFlow and PyTorch for building custom sentiment analysis models using neural networks.

Challenges:

Sarcasm and Irony: Detecting sarcasm and irony in text is a significant challenge as it often requires understanding subtle cues and context.

Domain-Specific Language: Sentiment indicators can vary greatly across different domains, making it necessary to tailor models to specific areas.

Multilingual Analysis: Analysing sentiment in languages other than English, especially those with less NLP resource support, can be complex.

There are several open-source or open-access NLP tools available, which are widely used in academia and industry for various NLP tasks:

Natural Language Toolkit (NLTK):

A popular Python library providing easy-to-use interfaces to over 50 corpora and lexical resources, along with libraries for text processing for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

SpaCy:

An industrial-strength, Python-based NLP library that emphasizes efficiency and accuracy. SpaCy is designed specifically for production use and offers many pre-built models for various languages.

Stanford NLP:

A suite of NLP tools provided by Stanford University. It includes software for part-of-speech tagging, named entity recognizer (NER), neural dependency parser, and much more, often used in academic research.

Apache OpenNLP:

A machine learning-based toolkit for processing natural language text, supporting common NLP tasks such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution.

Gensim:

A Python library for topic modelling and document similarity analysis. Particularly known for its implementation of the Word2Vec model.

BERT and Transformers (by Hugging Face):

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model known for its effectiveness in a wide range of NLP tasks. Hugging Face provides a library of pre-trained transformers including BERT and others, which are highly influential in modern NLP.

Tesseract OCR:

An optical character recognition engine, useful for reading text from images and converting them into editable text formats.

FastText:

Developed by Facebook, FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers, with support for several languages.

CoreNLP:

Developed by Stanford University, it provides a set of natural language analysis tools which can identify the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and syntactic dependencies.

AllenNLP:

An open-source NLP research library, built on PyTorch, designed for high-quality and efficient research in deep learning-based NLP.

In summary, corpus linguistics significantly enhances terminological studies and management by providing empirical data about language use, aiding in term identification and standardization, enriching terminological resources, and supporting the development of tools and applications for effective terminology management.