Page Content

Posts

What Is Corpora In NLP, Types, Features And It’s examples?

What is Corpora in NLP?

Corpora in NLP
Corpora in NLP

A collection of real texts or speech transcripts that can be interpreted by a machine is called a corpus (plural: corpora). To be representative of a certain natural language or linguistic variant, corpora are sampled. Although this is a widely recognized characteristic that sets a corpus apart from an arbitrary collection of texts (an archive), “representativeness” is a flexible term. “The web as corpus” is an example of how the term “corpus” is often used more broadly to refer to any sizable collection of language data, even if it hasn’t been generated in a systematic manner.

Research on Natural Language Processing (NLP) and many other linguistic studies depend heavily on corpora. They offer a test environment and a material foundation for developing NLP systems. Additionally, corpora are required in NLP to test algorithms on natural language and provide data for machine learning techniques. While lexicographers use corpora to create dictionary entries by searching for evidence of word usage, linguists usually utilize them to locate instances of certain words or patterns to support or refute their hypotheses.

Types of Corpora and Their Features

  • One may classify corpora in a number of ways:
  • One language’s worth of data makes up monolingual corpora.
  • For contrastive research, comparable corpora comprise monolingual corpora in several languages collected with comparable representativeness, balance, and sampling durations. Several native languages are sampled using similar methods in a multilingual comparative corpus. The term “comparative corpora” is occasionally used to describe corpora that contain several regional variations of the same language, such as the International Corpus of English (ICE), however by one definition, these corpora are not comparable.
  • Original texts in one language with translations in one or more additional languages are referred to as parallel corpora. For NLP applications like machine translation, computer-aided translation, and multilingual information extraction and retrieval, aligned parallel corpora are essential. The Europarl corpus and the Canadian Hansards are two examples.
  • A general description of a language or linguistic variant is usually based on general corpora.
  • Specialized corpora often pertain to a certain genre (e.g., newspaper text) or domain (e.g., medicine). They could be gathered for particular purposes, such putting in place an application that answers emails.
  • Texts from many genres are found in balanced corpora. In order to be representative of the language or variety being studied, a balanced corpus often includes a large number of text categories that are sampled proportionately. The British National Corpus (BNC) is regarded as balanced and includes both spoken and written data. The Lancaster-Oslo-Bergen (LOB) corpus (British English) and the Brown Corpus (US English) are examples of early balanced corpora.
  • Simple text collections or texts organized into categories (genre, source, author, language), with some categories overlapping (subject), or with a temporal framework, are examples of corpus structure.
  • The majority of corpora are static resources, but a new field of study involves creating user-expandable dynamic treebanks.

When creating a corpus, important factors to consider are:

  • Size of the Corpus: The research question determines how much material is required. Exploring specific aspects, such as lexical development using saturation approaches, requires large corpora. Optimized search techniques are necessary because linear corpora, particularly plain or labelled ones, can be exceedingly big (billions of tokens).
  • Representativeness and Sampling: The degree to which a sample captures the whole spectrum of population diversity is known as representativeness. Balancing genres, domains, and media and selecting text chunks for each genre considerably affect how representative most corpora are. For a generic corpus to be representative, it must include as many text kinds as possible.
  • Balance: Describes the corpus’s diverse text kinds. The planned applications of the corpus dictate the appropriate proportion.
  • Information gathering and copyright.
  • Markup and annotation for corpora.
  • Corpus Annotation: By incorporating linguistic analysis, corpus annotation enhances a corpus and significantly broadens the scope of research issues that may be investigated. In order to extract linguistic information, this is frequently required. The creation of corpora has benefited greatly from NLP research, particularly in the area of annotation.

Annotation layers

Morphosyntactic annotation: Such as lemma, inflection information, and Part-of-Speech (POS) labelling. Two of the earliest corpora to be morphosyntactically labelled were the Brown and LOB corpora.

Syntactic annotation: Incorporating functions and phrase structure (typically in Treebanks).

Semantic annotation: Like anaphora and word meaning. A huge sense annotation corpus is called OntoNotes. In English, word senses are generally marked for each portion of speech individually; in Chinese, however, this is done per lemma.

Discourse annotation: Including temporal linkages and dialogue activities.

External annotation: Including details such as italics.

Annotated text corpora are those that have linguistic annotations that reflect these levels. Automatic annotation methods, occasionally combined with post-editing for error correction, are used in many published corpora.

Searching Corpora: For computer scientists and linguists, the capacity to search corpora is essential. big linear corpora search techniques must be tuned for very big datasets. Regular and Boolean expressions may be used for searching, and the results can be processed for frequency lists and statistical data using tools like Manatee/Bonito and the Stuttgart Corpus Workbench. Searching for linguistic phenomena contained in treebanks is made possible by treebank query languages.

Other Language Resources: NLP requires a variety of additional lexical and knowledge resources in addition to corpora. These may be anything from basic word lists that have been sorted to complex architectures.

Examples of Corpus

Lexicons: An all-encompassing term for materials that offer linguistic details about words or phrases. There are lexicons in the TIMIT Corpus. An ontology and a natural language are connected via a lexicon.

Dictionaries: Give terms meanings, examples of how to use them, and other linguistic details. The information sources for word sense disambiguation were machine-readable dictionaries. One well-known source is the Longman Dictionary of Contemporary English (LDOCE).

Thesauri: Sort words according to their shared meanings. One well-known example of word sense disambiguation is Roget’s Thesaurus.

Ontologies: Formal depictions of knowledge as a collection of ideas in a field and the connections among them. Semantic research is increasingly emphasising ontology-based methods, which are pertinent to fields like anaphora resolution and lexical semantics. Two examples of lexical resources with complicated relations or ontology-like structures are the Gene Ontology and the Unified Medical Language System (UMLS). An ontology is a source of static knowledge.

Fact Databases: Sources of static knowledge.

Onomasticons: Name lexicons.

Comparative wordlists: Used as lexical resources and in lexicography. One example is the Swadesh Corpus.

WordNet: An computerized English dictionary that defines other relations and arranges words into a hierarchy of synsets, or groups of words with similar or nearly similar meanings. You may download it for free. WordNet uses a lot of sample sentences. OntoNotes uses SuperSense Tagging, which assigns ideas from WordNet’s catalogue of 41 sensetypes.

VerbNet: Includes verbs that are connected to WordNet and arranged hierarchically.

Pronouncing dictionaries: Resources for lexicon, such as the CMU Pronouncing Dictionary.

Getting to and Using Resources

One tool for using Python to access lexical resources and text corpora is the Natural Language Toolkit (NLTK). Some NLTK corpus readers offer deeper linguistic material, such as POS tags, conversation tags, and syntax trees, while others give rapid access and a variety of techniques, such as words(), raw(), and sents().

Due to a lack of established writing systems or inadequate support, significant corpora are currently unavailable for many languages. The European Language Resources Agency (ELRA) and the Linguistic Data Consortium (LDC) are important sources of published corpora. They provide hundreds of annotated corpora in many languages that are available under either commercial or non-commercial licenses.

OLAC Metadata, a standard that extends Dublin Core with particular descriptors for language resources, is used by the Open Language Archives Community (OLAC) to offer infrastructure for documenting and finding language resources. Catalogues published by OLAC repositories can be retrieved and searched through the OLAC website.

Data collection, annotation, quality assurance, and publishing are all stages in the life cycle of managing linguistic data, which includes creating corpora. It can be necessary to transform existing data into forms that are appropriate for analytic tools. One practical format for exchanging and storing linguistic data is XML. Language documentation projects frequently employ the Toolbox format, and tools to handle and convert these files to XML may be created.

Index