Text Representation NLP
The act of transforming unprocessed text input into a format that computer systems can comprehend and process is known as text representation in natural language processing. Character encodings, which translate characters to distinct integers, are the basic way that computers encode text.
Different levels of abstraction and granularity can be used to view and represent text:

As Strings of Characters: At first glance, texts may appear like collections of characters. Text is usually saved and displayed in this manner.
As Sequences of Tokens/Words: In NLP, it’s usual to treat a text as a list of words or a sequence. Tokenization, the act of breaking up text into recognizable pieces known as tokens typically individual words is necessary for this. A vital initial step before further language processing is tokenisation.
Word-Internal Structure: Character or morpheme embeddings can be used to create word representations that capture word structure. The Finite State Transducer (FST) is a prominent approach for managing morphology and orthographic spelling variations by establishing links between surface forms (word strings) and underlying forms (morphemes). Multi-tape FSTs may be used for languages like Arabic that have intricate grammatical systems.
Vector Space Models, in which text components are represented as vectors lists or arrays of numbers are a popular method of Text Representation NLP for many NLP applications. This frequently entails text vectorization, which turns words in a document into significant numbers. Another name for this is feature extraction.
Vector representations Methods
These vector representations may be made using a variety of methods:
Bag of Words (BoW): A document is represented as a column vector of word counts or frequencies in this popular method for text categorization, which ignores the words’ order of occurrence. Because individual words may be powerful predictors, the BoW model can be unexpectedly good for text categorization even if it ignores syntax and sentence boundaries. Although it does not list the words in sequence, it does provide the number of each word. A document can be represented by a continuous bag of words (CBOW), which is made up of the sum of continuous, low-dimensional word representations.
Bag of N-Grams: This captures the frequency of n-word sequences, extending BoW.
TF-IDF (Term Frequency-Inverse Document Frequency): This popular text vectorisation approach gives words weights according to how important they are in a document in comparison to a corpus. It is employed in text categorization and information retrieval.
Word Embeddings / Distributed Representations: These techniques use k-dimensional dense vectors of real numbers to represent words. This method is based on the distributional hypothesis, which holds that a word’s meaning may be inferred from the company it maintains. The semantic links between words are captured by word embeddings.
A lot of unlabelled data may be used to learn them. Word2Vec, GloVe, FastText, and BERT are a few examples. By offering an embedding that captures a word’s meaning inside its particular textual context, contextualized embeddings like BERT go one step further. These are currently the most common word representations. A neural network’s lookup layer might begin by searching the input text for each word’s embeddings. Character or morpheme embeddings can also be used to create word representations.
One-Hot Encoding: A simple vector representation in which every distinct word or letter is represented by a vector with a “1” where it belongs and a “0” elsewhere. Usually, these vectors are sparse and high-dimensional.
Matrix Representations:
- Based on word counts, a term-document matrix shows words as row vectors and documents as column vectors.
- By expressing each word as a sparse vector encoding the weighted bag of circumstances in which it occurs, word-context matrices are able to capture the distributional features of words.
Hashing Vectorizer: Uses a hashing algorithm to map words to a fixed-size feature space. In addition to flat vector representations of words or documents, text may also be represented using:
Hierarchical Structures: Like the components in an HTML document, text may be examined in terms of hierarchical structures.
Syntactic Structures: For instance, parse trees can be used to depict a sentence’s syntactic structure. Syntactic category transitions can also serve as the foundation for feature vectors.
Semantic Representations: This entails developing formal representations that capture the content connected to meaning, such as logical forms used in machine translation, neo-Davidsonian event representations for sentences, or noting predicates and arguments. Especially contextual embeddings can be used to convey word senses.
Topic Models: Documents are represented according to their topic distributions using methods such as Latent Dirichlet Allocation (LDA), which finds themes within a corpus.
Structured Records: For example, news reports may be portrayed as organized documentation of the events they cover.
Semi-structured Text Representations: Particularly in documents like seminar announcements, the way language is physically laid out may affect how it is interpreted and, consequently, represented.
Cross-Modal Representations: Alongside other modalities, like pictures, text can be displayed. For instance, word/pixel sequences can be handled as unified tokens, or pictures can be stored as vectors and utilized to generate text. Text can be created from speech, or phonetic sequences can be used to represent pronunciation.
The particular NLP goal (text classification, information retrieval, machine translation, text production, sentiment analysis, etc.) and the properties of the text data itself have a major role in the choice of Text Representation NLP approach. To graphically portray and examine textual data, visualization techniques including word clouds, bar charts, scatter plots, and heatmaps are also employed.