Context Free Grammars In NLP Natural Language Processing

A fundamental formal framework that is frequently employed context free grammars in NLP(Natural Language Processing) are especially useful for simulating the syntactic structure of sentences.

Introduction of Context free grammars in NLP

The formalism is comparable to BNF, or Backus-Naur form. Since Chomsky first proposed CFG in 1956, it has been the most widely used grammar formalism to describe language syntax. The majority of other grammar formalisms are either linked to or developed from CFG. The “backbone” of many formal models of natural language grammar is thought to be CFGs. The fact that a production rule depends solely on its left-hand side and not on its context (neighbours or predecessors) is what gives it the term “context-free”.

What is Context-free grammar?

One popular formal approach for simulating the constituent structure of natural language is a Context-Free Grammar (CFG). Another name for these grammars is phrase-structure grammars. For grammars that possess at least context-free power, the term “phrase-structure grammar” is occasionally employed.

Official Definition

The formal definition of a CFG is a 4-tuple G = (N, Σ, R, S).

A finite collection of non-terminal symbols, sometimes referred to as variables or categories, is denoted by N (or VN). Usually, non-terminals are written in capital letters. They use terminal symbols to convey abstractions. T, often known as Σ, is a limited collection of terminal symbols. Terminals represent the language’s words, which are usually represented in lowercase. N and Σ are distinct sets.

A finite collection of production rules is denoted by R (or P). Every rule has the form A → α, where α is a series of symbols from (Σ∪N)* (the right-hand side or RHS, also known as the daughters) and A is a single non-terminal symbol (the left-hand side or LHS, also known as the mother). Although formal descriptions permit an empty sequence (), which is frequently omitted in practice to streamline parsing algorithms, the RHS must have at least one symbol.

S, a member of N, is a designated start symbol. Traditionally represented by the letter S, it is typically understood to be the “sentence” node.
All of the grammar’s symbols are contained in the set V = N ∪ Σ.
If and only if B → β is a rule in R, then αBγ ⇒ αβγ is the definition of the rewriting relation (⇒).
A series of rule extensions is called a derivation. A derivation is a series of steps that leads from the start symbol S to a surface string w ∈ Σ*, where w is the yield.
The grammar-generated string language L(G) The collection of strings made up of terminal symbols that can be obtained from the specified start symbol S is called G. Grammatical sentences are those that can be derived from a grammar and are found in the formal language that it defines.

Also Read About Grammar Correction NLP & What Is Question Answering In NLP

The role and goal Context-free grammar

One way to conceptualize CFGs is as a tool for either creating sentences or giving a sentence a structure.
They offer a succinct description of a potentially endless collection of sentences.
They function as a formal model that describes whether a phrase can be allocated a specific dependence structure or constituent.
According to the generative grammar framework, a grammar is a formal notation for “generating” these members, and a language is thought of as a vast collection of all grammatical sentences. The collection of potential sentences “generated” by the grammar defines the language.
When describing parsing algorithms, they are frequently utilized as a foundational formalism.
Assigning a syntactic structure to a sentence or mapping a string of words to its parse tree is known as syntactic parsing. A parser creates structural analyses that follow the grammar by processing input sentences in accordance with grammar productions. A parser is a procedural interpretation, whereas a grammar is a declarative declaration of well-formedness.
Applications like as grammar verification, formal semantic analysis, and text analysis tools for modelling sentence element interactions can all make use of the parse trees that CFGs assign.
Although pure CFG is not commonly used for broad natural language grammars in reality, it has been one of the most popular techniques for speech recognition.

Characteristics and Differences

Constituent Structure

When a sentence is recursively broken down into smaller parts, such as noun phrases and verb phrases, CFGs naturally produce constituent structures. Trees are used to depict these structures, with interior nodes serving as non-terminal nodes and leaf nodes as terminal nodes.

Normal Forms

Specific normal forms of CFGs are the focus of some parsing methods. One popular one with limited production restrictions is the Chomsky Normal Form (CNF). Unary terminal rules (A → w) and non-empty non-terminal rules (A → B₁…B_d) may be permitted under a loosened CNF.

Equivalence

The same language can be produced by two different CFGs. Grammars can be classified as weakly equivalent (same strings, perhaps different phrase structure) or highly equivalent (same strings, identical phrase structure). A weakly equivalent grammar is produced when empty productions are excluded.

Recursive Productions

Similar to self-loops in finite state automata, a non-terminal can occur on both the LHS and RHS of a rule, enabling recursive structures. As a result, languages with arbitrary deep structures, such arithmetic statements with balanced brackets, can be described by the grammar.

Grammatical ambiguity

CFGs may result in a statement having more than one potential parse tree. In parsing, this is a serious issue.

Associated with other Formalisms

Numerous more grammar formalisms are based on CFGs.
Certain formalisms, such as Generalized Phrase-Structure Grammar (GPSG) and Regulus, are identical without regard to context.
Others are rigid expansions of CFG, including Lexical-Functional Grammar (LFG) and Head-driven Phrase-Structure Grammar (HPSG).
Feature terms, or attribute-value pairs, are frequently employed in constraint-based grammars in place of the atomic categories found in CFGs.
It has been demonstrated that formalisms such as Dependency Grammar, Tree Adjoining Grammar (TAG), and Categorial Grammar are equal to CFG or a CFG extension. Examples of weakly equivalent formalisms that are somewhat context-sensitive are Combinatory Categorial Grammar (CCG) and TAG.
CFGs can communicate lexical syntactic preferences that are essential for disambiguation by including lexical information through lexicalization.
A straightforward development of CFGs, probabilistic context-free grammars (PCFGs) assign a probability to each production rule. They are used to give parse trees probabilities in order to aid resolve parsing ambiguity, and they are the most basic probabilistic model for recursive embedding. Weighted Context-Free Grammars (WCFGs) are a subset of PCFGs.
There is a similarity between pushdown automata and CFGs.
Another name for Syntax-Directed Transduction Grammars (SDTGs) is synchronous context-free grammars.

Also Read About Lexical Analysis Definition In NLP: Knowing Word Structure

Limitations

Since pure CFG frequently lacks the expressive capacity to capture all of the quirks of natural language, including agreement, inflection, long-distance dependencies, and specific forms of scrambling, it is not frequently employed in practice to create comprehensive natural language grammars.

In formal languages like CFGs, the strict distinction between “grammatical” and “ungrammatical” is an oversimplified representation of natural language, where context frequently determines grammaticality.

According to contemporary views, natural languages need the strength of more expressive formalisms and are not entirely context-free.

Page Content

Tutorials