Lemmatization vs Stemming in NLP And What is Stemming?

What is stemming in NLP? Discover how it reduces words to their root forms to improve text analysis and search performance and Lemmatization vs Stemming in NLP.

What is Stemming?

One text processing job in information retrieval (IR) and natural language processing (NLP) is stemming. The process of breaking words down to their root, or essential component, is called stemming. Finding the stem or extracting a root word is required for this. Finding a word’s stem and using it instead of the actual word is the goal of stemming techniques. It is sometimes characterized as a more straightforward or unpolished approach than lemmatization, which is just a rudimentary kind of morphological analysis. The fact that the generated stem could not be the linguistic stem or a full English word is a crucial feature.

The words “fish,” “fishes,” and “fishing” come from “fish.” “Help” is root of “helping” and “helper.” The stem “laugh-” stems “laughing,” “laugh,” “laugh,” and “laughed.” For instance, “celebrate,” “celebrates,” “celebrated,” and “celebrating” are normalised to a root word that sometimes has no meaning.

Goals and Inspiration

By combining morphologically complicated strings under an invariant stem, stemming primarily aims to conflate word variations into the same stem or root. Processing may now concentrate on a word’s fundamental meaning instead of its use details.

By reducing morphologically complicated forms to a canonical form, stemming is a typical morphological normalisation technique used in information retrieval. When matching documents to a query, the idea is to enhance the success rate by grouping terms with the same root under the same stem (or indexing term). Words with the same stem are assumed to allude to the same notion or concept, and as a result, they should be indexed under the same form. It is beneficial for search engines to generalise across similar words, such “whale,” “whales,” “whalers,” and “whaling.” The preprocessing step of stemming is used to create key terms automatically.

Algorithms and Methods

Stemming is usually done with algorithms that remove suffixes using rules, usually at the character level. Regular expressions may be used in a straightforward manner to remove suffixes.

The following are some stemming algorithms:

Suffix Stripping: Eliminating inflectional suffixes such as “-ed,” “-s,” and “-ing.” It is known as “light” suffix-stripping. More complex methods additionally eliminate derivational suffixes like “-ment,” “-ably,” and “-ship.”
Rule-Based: Algorithms follow rules, which are frequently governed by qualitative or quantitative constraints. For example, subtract “-ize” if the stem doesn’t finish in “e,” or delete “-ing” only if the stem contains more than three characters. It is possible to utilise certain ad hoc spelling correction rules for correctness. For instance, if the stem contains a vowel, the rules ING -> ε and ATIONAL -> ATE are examples of the Porter stemmer.
Common Stemmers: Two well-known algorithms are the Lovins and Porter stemmers. Unlike Lovins, which is based on approximately 260 suffixes, Porter’s approach searches for about 60. The Porter stemmer is a popular algorithm that doesn’t need vocabulary.
Corpus-Based Approaches: Xu and Croft (1998) propose a corpus-based strategy for developing stemming techniques that takes into account co-occurrence data and language use.
The Porter stemmer is one of the built-in stemmers offered by libraries like NLTK.

Lemmatization vs Stemming in NLP

Lemmatization, another type of morphological normalisation, is frequently used as a comparison to stemming.

The process of stemming involves extracting a root word, which may be only a fragment and not always a legitimate term. “Leaves” stemming to “leav” and “Discovery” stemming to “discoveri” are two examples.
Similarly to stemming, lemmatization distils words to their most basic meaning while attempting to identify the lemma or lexeme. It looks at the vocabulary and frequently the part of speech to provide a complete English term that must be a genuine word and make sense on its own. For instance, “leafs” lemmatizes to “leaf,” whereas “am,” “are,” and “is” all share the lemma “be.” Results from lemmatization can be improved. A common term for morphological analysis is lemmatization.

While lemmatization deals with matching “car” to “automobile” and “car” to “cars,” stemming deals with matching “car” to “cars.” Disambiguation at the lexeme level is implied by lemmatization.

Difficulties and Limitations

Stemming lacks a clear definition, and character-based algorithms are inherently imprecise. This causes a number of problems:

Errors: Because stemming processes disregard word meanings, they frequently make mistakes. These mistakes may be caused by:
- Over-stemming is the practice of combining words that shouldn’t be connected or that have various semantic meanings. Examples include “general” becoming “gener,” “gallery” and “gall” becoming “gall-,” and “organisation” becoming “organ.” Excessive stemming can have a detrimental effect on search results.
- The act of not combining words that belong together, such “create” and “creation,” with Porter’s stemmer is known as under-stemming.
Information Loss: Putting word variations under one stem might result in significant information loss. Inflected terms are a better way to search for “operating systems” than just “operut-” and “system.”
Unintelligible Stems: Due to stemming, users may not be able to understand the shortened stems. Users could not comprehend why “business” is stemmed to “busy” and why a document with “busy” in it is returned. It might be misleading to portray “gallery” as “gall-.”
Language dependence refers to the fact that lemmatization and stemming are language-specific. For languages with a lot of morphology, such as Finnish, stemming is considerably more difficult since the stems are frequently altered by the addition of suffixes, resulting in irregular forms.
The system should preferably not stem proper nouns like “Collins” or “Hawking” if it is able to recognise them.
Phrase Searching: It becomes difficult or impossible to look for phrases that contain shortened words once stemming has been applied.

Application in NLP and IR

One typical aspect of IR systems is stemming. Nevertheless, a great deal of empirical research in the IR field has demonstrated that, when averaged over enquiries, stemming does not always improve the performance of traditional IR systems. For certain questions, it is helpful, but for others, performance suffers. In terms of linguistic intuition, this is a surprise outcome. The loss of information is one of the causes.

This IR finding may not be applicable to every application of statistical natural language processing. In other instances, morphological analysis can be even more helpful. In a non-interactive assessment of IR systems, stemming is useless; however, in an interactive IR setting, principled morphological analysis can be useful. It has been proposed that having participatory impact on stemming might be beneficial, especially for ambiguous circumstances like “saw” (verb vs. tool) or derivational cases. However, this highlights the challenges of automatic stemming in contexts with little information.

Although they are typically made for generic text, stemming algorithms can be customised for particular document collections or domains. Morphological analysis, also known as lemmatization, is a prerequisite for parsing in the majority of other NLP applications, although stemming is utilised in IR in place of complete morphological analysis.

Page Content

Tutorials