Lexical units made up of two or more words that work as a single unit and have a meaning that goes beyond the sum of the words alone are known as multi-word expressions, or MWEs. This indicates that their characteristics cannot be entirely inferred from their constituent words alone.

What is MWE?
- Lexical elements made up of several lexemes are known as multiword expressions.
- At the lexical, syntactic, semantic, pragmatic, and/or statistical levels, they exhibit idiomaticity.
- MWEs are often composed of many whitespace-delimited words, such “marketing manager,” in languages like English. However, highly productive compound nouns that are spelled as a single word, such as Kontaktlinse (contact lens), may also be categorized as MWEs in languages like German. The distinction is more about decomposability into isolated lexemes than it is about whitespace in non-segmented languages like Chinese and Japanese.
- “Multiword unit,” “multiword lexical item,” “phraseological unit,” and “fixed expression” are terms that are frequently used interchangeably with “multiword expression.”
- A popular concept that is closely related to MWEs is collocation, which is sometimes described as a random and frequent word combination. Collocations, which are usually regarded as a legitimate subset of MWEs and mainly include statistical idiomaticity rather than syntactic or highly semantic idiomaticity, are commonly distinguished from idioms or non-compositional phrases.
- The term “idiom” might apply to any multiword item, or it can refer more specifically to semantically idiomatic MWEs.
- Finding and categorizing technical terms which might be MWEs or basic lexemes that are frequently domain-specific is the major goal of terminology research.
Characteristic of MWEs
Language Qualities Idiomaticity is a defining characteristic of MWEs.
- Lexical Idiomaticity: Occurs when terms that do not independently occur in the conventional lexicon are included in a MWE, such as “cranberry” in cranberry sauce.
- Syntactic Idiomaticity: The MWE’s components do not directly determine its syntax. For instance, is mostly adverbial yet is made up of an adjective and a preposition.
- Semantic Idiomaticity: It is impossible to infer the MWE’s meaning from the conventional interpretations of its constituent parts. For example, “summed up briefly” is what “in a nutshell” means. MWEs with more usage could be seen as less semantically idiomatic.
- Pragmatic Idiomaticity: The MWE is linked to a specific context or a set of circumstances. Examples include “good morning” and “all aboard.”
- Statistical Idiomaticity: Refers to a word combination’s lexical affinity or relative frequency. This may show up as a high frequency of possible combinations or, on the other hand, a low frequency. Spotless has a negative association with credentials, but immaculate has a strong lexical association with performance.
- MWEs may possess additional characteristics in addition to idiomaticity, although these are not required nor sufficient for classification:
- Single-word paraphrasability: While many MWEs cannot be paraphrased in a single word, others may (for example, leave out = omit).
- Proverbiality: The capacity to use specific objects to express a recurring event (e.g., verb particle formations and idioms typically imply informality).
- Prosody: Particularly when component words do not evenly contribute to the overall meaning (e.g., soft spot), MWEs may exhibit diverse stress patterns.
Types of MWEs
MWE types There is discussion of many main categories of MWEs:
- Nominal MWEs: One of the most prevalent kinds.
- Noun Compounds (NCs): A combination of two or more nouns (golf club, for example). The head is the noun on the right. These are frequently single compound terms found in Germanic languages. NCs with three or more words sometimes struggle with syntactic disambiguation (bracketing) (e.g., glass window cleaning).
- Nominal Compounds: A more general class in which modifiers can also be adjectives (e.g., connecting flight, open secret) or verbs (participles).
- Complex nominals: Romance languages, such as Italian succo di limone (meaning “lemon juice”), use a preposition or marker in between nouns.
Verbal MWEs
- Phrasal Verbs: A particle and a verb, frequently with meanings that are not compositional. Particles may be inseparable or separable (take the paper in, for example).
- Verb Particle Constructions (VPCs): A verb and a particle together. Particle often has a unique POS tag.
- Prepositional Verbs (PVs): A verb that is preceded by an unavoidable preposition, often with a passivised object (e.g., refer to vs. grow on). difficult to tell apart from basic verb-preposition combinations and VPCs.
- Light-Verb Constructions (LVCs): Consist of a verb that is semantically “light” and a noun complement (usually indefinite singular) (e.g., take a walk). frequently paraphrased using the noun complement’s verbal form. occur in a variety of languages. Also known as support verb structures or verb-complement pairs.
- Verb–Noun Idiomatic Combinations (VNICs): At least semantically idiomatic, the verb and noun are in direct object position (e.g., kick the bucket).
Prepositional MWEs
- Determinerless PPs (PP-Ds): Prepositional phrases that exhibit syntactic or semantic markedness but lack a determiner. may be extremely productive or subject to stringent limitations.
- Complex Prepositions: Phrases that serve as prepositions (e.g., in addition to, on top of). Both fixed and semi-fixed are possible.
Grouping Lexicalised and institutionalised phrases are distinguished by a high-level categorisation.
Lexicalized phrases
Exhibit idiomaticity that is lexical, syntactic, semantic, or pragmatic.
- Fixed expressions: Fixed strings that are largely ad hoc and lack internal change or morphosyntactic diversity.
- Semi-fixed expressions: Have severe limitations on word order and composition (e.g., non-decomposable VNICs, nominal MWEs like attorney general), but permit limited lexical diversity (inflection, pronoun/determiner selection).
- Syntactically flexible expressions: Experience syntactic variety (e.g., decomposable VNICs, LVCs, and VPCs).
- Institutionalized phrases: Display just statistical idiomaticity, or collocations, like “salt” and “pepper.”
Importance and Challenges in NLP MWEs
The Value and Difficulties of NLP MWEs are widely distributed and essential to language, improving marking register and fluency. According to estimates, their quantity is comparable to that of basic terms in a speaker’s vocabulary. There are always new MWEs being made. MWEs provide serious difficulties for NLP systems:
Machine Translation (MT): In order to capture minor effects, MWEs have been employed extensively. It is essential to comprehend MWEs in order to align words. Phrases (which might be MWEs) are used as the fundamental translation units in phrase-based SMT systems. In other languages, MWEs might not have direct translation counterparts.
Parsing: Syntactic analysis is made more difficult by MWEs. Sentence construction is made simpler by explicit MWE data, although parsing mistakes might result from a lack of it. To deal with MWEs, grammars can be lexicalized. When dealing with multi-part words or non-segmented languages, tokenization and morphological analysis are similar.
Information Retrieval (IR): Multiword words and compound constructs can be used as indexing characteristics. Although simple word indexing is frequently used, accuracy can be increased by employing multiword phrases. For IR, collocations are crucial.
Word Sense Disambiguation (WSD): In Word Sense Disambiguation, MWEs such as idioms and specialized collocations are associated problems. Accurately identifying MWEs affects the accuracy of semantic labelling. It is possible to take use of the “one sense per collocation” theory. One particular task in WSD is the identification and sense-annotation of MWEs.
Information Extraction (IE): An important initial step is to recognise proper names and multiwords. This level recognises proper names and fixed sentences that use certain microgrammars.
Sentiment Analysis: Idioms and opinion expressions are useful characteristics. Multiword units can be represented using word embeddings, however purely distributional approaches could only be effective for non-compositional ones. It could be necessary to use hybrid distributional-compositional techniques to represent longer multiword spans.
Processing MWEs
Identification and Extraction: POS taggers, chunkers, and parsers that encode MWE information are some of the techniques. MWEs can be distinguished from literal usages by deep parsers that have lexical entries for them. It is possible to employ automated methods such as the C-value/NC-value approach. Statistical metrics such as pointwise mutual information (PMI) are frequently used in collocation extraction.
Interpretation: It’s helpful to understand MWEs in general. There are techniques for automatic interpretation, especially for noun compounds. Bracketing is necessary for syntactic ambiguity in multi-term nominal MWEs.
Representation: Research is being done on representing multiword units for vector space models and word embeddings. Non-compositional phrases can be treated as separate pieces and subjected to pure distributional algorithms. For longer periods, hybrid strategies that combine compositional and distributional techniques are investigated. Robustness and expressiveness can be combined in hybrid distributed-symbolic representations.
Normalization: In order to normalise differences in hyphenation or representation ($3.9 to $4 million vs. 3.9 to 4 million dollars), tokenisation must determine whether to consider phrases like 76 cents per share as one token or several tokens.
MWEs are essentially complex language phenomena with a variety of idiomaticity that present both substantial obstacles and rich information for tasks related to Natural Language Processing in a variety of applications, such as sentiment analysis, machine translation, and parsing.