Words in NLP

The basic building blocks of natural language texts in Natural Language Processing (NLP) are words. Since they can alone construct a full utterance, they are frequently regarded as the smallest language units. But what exactly qualifies as a “word” is a very complicated and sometimes imprecise term.
It’s critical to distinguish between tokens and word kinds, or lemmas, in NLP. The basic form or dictionary entry of a word, such as “horse,” is known as a word type or lemma, whereas a token is the word’s surface form as it appears in running text. A token can be made up of several words, several tokens can stand in for a single word, and occasionally different tokens indicate the same underlying phrase. Tokenising, which is frequently accomplished by dividing on punctuation and whitespace, approximates words but must manage difficult situations like contractions (“don’t”) or multiword expressions (“New York”).
Morphology is the study of how words form and what makes them up. Morphological processes that link word categories systematically include the formation of the plural (e.g., “dog-s” from “dog”). The generative nature of language and the possibility of new word forms in texts that are morphologically related to existing words make an understanding of morphology crucial. Lexical analysis is a fundamental task that uses a lemma dictionary to match morphological variants to their lemma.
Word-Based NLP Activities

Many NLP activities revolve around words, which are employed in a variety of representations:
- As distinct objects or symbols. Features in natural language processing (NLP) difficulties are frequently expressed as counts or indicators based on the presence or frequency of a word.
- Bag of Words representations, which are frequently employed in sentiment analysis and text categorisation, show text as a vector of word frequencies.
- N-grams are collections of n consecutive words. [Our conversation history] Local word dependencies are captured by N-grams. Known for their function in Language Models, they are a fundamental approach in Natural Language Processing [Our Conversation History]. Powerful feature combinations that capture structures like “New York” or “not good” [Our discussion history] are another useful application for them. [Our conversation history] Sometimes a bag of bigrams can be more effective than a bag of words. In [Our conversation history], N-grams of characters are also utilised as characteristics. Simple N-grams are presented in relation to part-of-speech labelling and prediction.
- Word embeddings are vector representations of words that are based on syntax and semantics. These representations make an effort to convey the distributional characteristics of words, which hold that meaning is found in the distribution of circumstances in which words are employed.
- The relationship between a word and its context is measured in word-context matrices, where each row denotes a word and the columns indicate linguistic contexts.
Certain NLP activities directly need word processing and comprehension:
- Part-of-Speech (POS) tagging is the process of automatically classifying words in a text according to lexical categories (such as noun, verb, and adjective). Another name for word classes is parts-of-speech. The word and its context determine the POS tag assignment.
- Determining the sense or meaning of a word is being used in a given situation is known as word sense disambiguation, or WSD. The problem of lexical ambiguity brought on by a word having several meanings is addressed here.
- Using text corpora, lexical acquisition seeks to understand the syntactic and semantic characteristics of words. This entails learning about semantic classification, selectional preferences, and collocations word combinations with specific meanings. Multiword expressions are common phrases that may be found in a dictionary but have unique characteristics.
- Word Prediction predicts the subsequent word in a series, modelling certain syntactic information.
The fundamental building blocks of higher-level NLP analysis, such as syntactic parsing (figuring out the grammatical structure of sentences made up of words) and semantic analysis (extracting meaning from texts and sentences), are also words. For applications such as text summarisation, machine translation, and information retrieval, word comprehension is essential.
Despite their apparent simplicity, “words” are handled with nuance in NLP, taking into account their internal structure (morphology), base form (lemmas), surface form (tokens), and representation in different ways (counts, N-grams, embeddings) to capture their properties and relationships for use in a variety of tasks.
words and their constituent parts
Word Tokens versus Word Types and Lexmes
A string as it appears in running text is called a word token. For instance, in “Will you read the newspaper? Will you read it?” The word “Will” appears twice, resulting in two “Will” tokens. The process of tokenisation divides text into discrete units, usually starting with punctuation and spaces.
A more abstract entity that functions as a cover term for a collection of related strings is called a word type, sometimes referred to as a lexeme. A lexeme is regarded as a fundamental unit of meaning. A language’s lexicon consists of its collection of lexemes. As an example, the lexeme that represents the collection of forms {delivers, deliver, delivering, delivered} is the verb DELIVER. Likewise, the surface form “horses” has the word type or lemma “horse” as its root. Word forms are the specific forms, such as “sung,” “carpets,” or “sing.”
Lemma and Lemmatization
A lexeme is identified by its lemma, which is its citation form. For instance, “delivers” has the lemma DELIVER, and “horses” has the lemma “horse.” The lemma usually captures the word’s invariant syntactic and semantic information.
A fundamental job of lexical analysis is lemmatization, which connects morphological variants to their lemma. It is utilized in NLP systems for a number of reasons, including using a lemma dictionary to access lexical semantics.
Internal Organization: Morphemes and Morphology
Morphemes, which are little, meaningful components, make up words. These are the smallest word components that convey meaning. These elements are also known as segments or morphs. The morphemes “play” and “-ed” combine to form the word “played.” The words “un-,” “friend,” and “-ly” make up “unfriendly,” while “cats” is made out of “cat” and “-s.”
The branch of linguistics known as morphology studies the internal structure of words. The study of word structure is called morphological parsing.
A word’s stem is its main constituent; the morphemes that are joined to it are called affixes. Affixes can be either suffixes (after the stem) or prefixes (before the stem). Additionally, some languages contain circumfixes (around the stem) or infixes (inside the stem).
Two types of morphology exist: derivational, in which affixes alter a word’s grammatical category or meaning (grace -> graceful), and inflectional, in which affixes add grammatical features like tense or number (play -> played).
Dissecting words to extract valuable information for subsequent processing steps is the focus of computational morphology. It is more efficient to break words down into their constituent elements rather than cataloguing all of the word forms. Words not found in a dictionary can also be handled with the use of morphological processing.
Due to their computational efficiency and capacity to map between lemmas (containing morphosyntactic information) and surface forms, finite state techniques such as finite state transducers, or FSTs are frequently employed for morphological processing.
Inflections map to the same sequence via stemming, a coarser procedure than lemmatisation that uses heuristics to map words to shorter sequences (picture, pictures, pictured -> pictur). A stemmed word is not always a legitimate term.
Word Senses and Meaning
The study of individual word meanings is known as lexical semantics.
Word senses are the different meanings that a single word (lexeme or lemma) might have. Polysemy is the name given to this phenomena. For instance, “bank” might describe a riverbank or a financial organization. In terms like “horse-fly” and “Trojan horse,” “horse” can have many meanings.
Word Sense Disambiguation (WSD) is the process of determining the meaning of an ambiguous word. For WSD, contextual information is essential. Examples of this include the topic of the document or the words that surround it.
Sense relations that link word meanings include synonymy, antonymy, hyponymy, hypernymy, and meronymy.
Strong lexical resource WordNet reflects these sense linkages and arranges words into synsets, or synonym groupings. When there is an excessive number of fine-grained senses, it can be helpful to organise word senses into broad categories known as supersenses.
Multiword Expressions Go Beyond Single Words
Words alone cannot convey all meaning. Multiword expressions (MWEs) are statements with multiple words with unpredictable lexical, syntactic, semantic, pragmatic, or statistical properties. Examples include “kick the bucket” and “by the way”. The mental lexicon must include these expressions to understand them.
Computational Processing Using Words
Lexical analysis, which separates the input stream into words and sentences, is frequently the first stage in processing natural language text. This stage separates text into paragraphs, phrases, and words by scanning letters and turning them into intelligible lexemes.
One of the most important aspects of text analysis is giving each word token a grammatical category or part of speech (POS). Like noun, verb, and adjective, these categories explain the syntactic function of words. The three main components of speech noun, verb, and adjective are generally accepted by linguists as a universal minimal set. Additionally, morphosyntactic information is carried by POS tags.
Word embeddings are vectors that capture syntactic and semantic information as well as the relationships between words in a continuous, high-dimensional space. They are frequently used to represent words in contemporary natural language processing. Word2Vec, GloVe, and BERT create these embeddings. BERT contextual embeddings represent a word’s meaning in its textual context.
In NLP, word forms, lemmas, stems, prefixes, suffixes, and POS tags are crucial for machine learning model development. Words behave differently based on context, therefore contextual data like surrounding words and POS tags are important.