Page Content

Tutorials

What Are The Full-Text Search in PostgreSQL With Example

Full-Text Search in PostgreSQL

In order to overcome the shortcomings of conventional SQL pattern matching operators such as LIKE and ILIKE for natural language documents, PostgreSQL provides a robust Full-Text Search (FTS) feature. Finding documents that fit a query and sorting them according to relevance are the functions of FTS, which also offers language help, search result ranking, and effective indexing.

Introduction to Full-Text Search

Due to their lack of linguistic support (e.g., handling derived words like “satisfies” and “satisfy”), lack of inherent ranking of results, and potential for slowness in the absence of index support which often requires scanning all documents traditional operators like LIKE and ILIKE are inadequate for modern information systems. PostgreSQL FTS preprocesses documents and queries to get around these problems. Included in this preprocessing are:

Parsing documents: Creating tokens from raw text, such as words, numbers, and email addresses, is known as document parsing.

Converting tokens to lexemes: Normalizing tokens (e.g., folding capital to lowercase, deleting stop words, and removing suffixes) so that various word forms correspond. Lexmes are normalized strings that are helpful for searching and indexing.

Storing preprocessed documents: Document storage that has been preprocessed making documents more searchable, usually by displaying them as ordered arrays of normalized words with positional data for proximity ranking.

FTS uses two special data types

Tsvector: Once variants have been converted to a common normal form, the tsvector stores distinct words (lexemes) and represents a document in a format ideal for text search. Additionally, it can hold integer weights and positions (A, B, C, and D) for lexemes to help with ranking and reflect document structure.

Tsquery: Defines a full-text query by combining lexemes with phrase search operators (<-> for FOLLOWED BY, or for a given distance N) and Boolean operators (& for AND, | for OR,! for NOT). Weights or an asterisk (*) can also be used to identify lexmes in a tsquery for prefix matching.

Tables and Indexes for Full-Text Search

Although FTS can be carried out without an index, producing an index is typically necessary since it is typically too slow for real-world applications. In order to speed up full-text searches, PostgreSQL provides several index types:

GIN (Generalized Inverted Index): The GIN (Generalized Inverted Index) is the index type of choice for text searches. A condensed list of matching locations is stored for every word (lexeme) in GIN indexes. They are somewhat slower to update and need more time to construct, but they are roughly three times faster for lookups than GiST indexes. Queries containing weights might need a table row recheck since GIN indexes only save lexemes, not their weights.

GiST (Generalized Search Tree): Additionally, tsvector and tsquery types can be indexed using GiST (Generalized Search Tree) indexes. A check of the real table row is necessary to remove any false matches that GiST may make because it is a lossy index. If there are fewer than 100,000 distinct lexemes, GiST indexes perform well and update more quickly for dynamic data.

BRIN (Block Range Index): When huge datasets are naturally clustered on particular features, BRIN (Block Range Index) is helpful since it provides a trade-off between index size and search efficiency.

RUM: PostgreSQL’s RUM index mechanism improves Full-Text Search (FTS). The GIN (Generalized Inverted Index) approach is enhanced by this improvement. RUM’s main feature is faster phrase search and relevance-sorted results.

Text search indexes are usually created using CREATE INDEX with GIN or GiST on a tsvector column. Take the following example: CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector(‘english’, body));.” A trigger should be used to update the tsvector representation whenever the text changes, and it is frequently advised to save it in a different column. By avoiding the need to explicitly provide the text search settings in queries, this speeds up searches and enables index-only scans.

Code Example:

DROP TABLE IF EXISTS articles;
CREATE TABLE articles(
    id SERIAL PRIMARY KEY,
    body TEXT,
    tsv tsvector
);

INSERT INTO articles(body, tsv) VALUES
('PostgreSQL supports full text search with GIN index', to_tsvector('english', 'PostgreSQL supports full text search with GIN index')),
('GiST index can also be used for text search', to_tsvector('english', 'GiST index can also be used for text search')),
('BRIN index is useful for large clustered data', to_tsvector('english', 'BRIN index is useful for large clustered data'));

CREATE INDEX idx_articles_tsv ON articles USING GIN(tsv);

SELECT id, body
FROM articles
WHERE tsv @@ to_tsquery('english', 'search & index');

Output:

DROP TABLE
CREATE TABLE
INSERT 0 3
CREATE INDEX
 id |                        body                         
----+-----------------------------------------------------
  1 | PostgreSQL supports full text search with GIN index
  2 | GiST index can also be used for text search
(2 rows)

Controlling Text Search

Parsing Documents: To create a tsvector from raw text, use the to_tsvector function. It uses dictionaries to reduce the tokens from the document to lexemes, then returns the lexemes together with their places. In order to function, the process needs a text search configuration that specifies the parser and a collection of dictionaries for various token kinds. “Rats” becomes “rat” if a dictionary accepts it as a plural form, for example. Stop words (such as “a,” “on,” and “it”) are not used in searches because they are too common.

  • In order to transform user-written text into a query, PostgreSQL offers a number of functions:
  • When text already contains tsquery operators (&, |,!, <->), to_tsquery: Generates a tsquery from that text. According to the setup, it eliminates stop words and normalizes tokens to lexemes.
  • Using parsing and normalization similar to to_tsvector, plainto_tsquery transforms unformatted text before introducing & (AND) operators between remaining words. It does not take into account prefix-match labels, weights, or Boolean operators in its input.
  • Phraseto_tsquery is a useful tool for searching precise lexeme sequences because it handles stop words inside phrases and combines words using the <-> (FOLLOWED BY) operator.
  • A condensed form of to_tsquery, websearch_to_tsquery has syntax akin to web search engines and supports quotes for phrases and hyphens for negation (e.g., “supernovae stars” -crab).

Ranking Search Results: The most relevant matches are shown first since ranking determines how relevant a page is to a query. In PostgreSQL, two preset ranking functions are available:

ts_rank: In PostgreSQL’s Full-Text Search (FTS), the ts_rank function measures document relevance to a query and returns a score to display the most relevant matches first. Ranking texts by matching lexeme frequency is its main method.

ts_rank_cd: PostgreSQL’s Full-Text Search (FTS) ts_rank_cd function ranks documents based on their relevancy to a query. Its score helps organize search results to show the most relevant documents.

Both features take into account lexical, proximity, and structural data (frequency of terms, term proximity, and document part importance). Weights given to lexemes (A, B, C, and D) usually indicate the importance of words from specific document regions (title vs. body, for example). Because longer documents are more likely to contain query phrases, normalization options can modify ranks according to document length. Due to the need to examine the tsvector of every matching document, ranking might be a costly process.

Highlighting Results: The purpose of the ts_headline function is to efficiently display search results by creating a document extract that highlights query terms. It allows choices like.

  • Strings called StartSel and StopSel are used to delimit highlighted query words (for example, and for HTML output).
  • Find the longest and shortest headlines with MaxWords and MinWords.
  • The minimum word count that can be omitted from the beginning or conclusion of a headline.
  • Use the entire document as the headline by using the HighlightAll Boolean flag.
  • The most text snippets that can be displayed. In order to pick and stretch fragments around query words, fragment-based creation requires a number greater than zero.
  • To divide several fragments, use a string called FragmentDelimiter (default: “… “).

It is important to use ts_headline cautiously since it can be slow because it works with the original document text rather than the tsvector.

psql Support for Text Search

The following commands can be used to examine text search objects in the psql interactive terminal:

\dF[+] [PATTERN]: The \dF[+] [PATTERN] command in PostgreSQL’s interactive terminal displays text search configurations. Text search configurations are listed without the +. + offers extra configuration details, such as the parser and dictionary mappings for each token type.

\dFd[+] [PATTERN]: The PostgreSQL interactive terminal command \dFd[+] [PATTERN] offers text search dictionary information. Use this command to list text search dictionaries. Add the + (plus sign) parameter to the command to get more dictionary details.

\dFp[+] [PATTERN]: In PostgreSQL’s interactive terminal, the \dFp[+] [PATTERN] command is handy for examining text search parsers. Run this command to list text search parsers. The command offers more parser information if + is added. This additional data can contain parser methods and functions like “Start parse”, “Get next token”, “End parse”, “Get headline”, and “Get token types”.

\dFt[+] [PATTERN]: PostgreSQL \dFt[+] [PATTERN] interactive terminal command displays Full-Text Search template information. This command displays text search templates, which are dictionary functions. The command gives more template details if the + symbol is added.

You can test and debug custom text search setups with these commands.

DBeaver: Database full text Search

In addition to basic SQL chores, DBeaver is a database administration application that offers functionality for managing different databases, such as a complete text search engine.

In DBeaver, to conduct a full-text search:

  • In the main toolbar, click the arrow next to the Search symbol. From the dropdown menu, choose Database full text Search. To access the DB Full-Text tab in the Search window, click the Search button on the main menu.
  • In the Databases column, extend the tree and pick checkboxes to select the database connection or connections or individual database items to search against. After choosing the proper number of checkboxes, the Search button is activated.
  • A case-sensitive search, a rapid search, and a search in numbers and LOBs (Large Objects) are among the choices available to users.
  • In a separate Search view, search results are displayed. You can open the matching object in a dedicated Database Object editor by double-clicking a row in this view.

In addition, DBeaver offers alternative search kinds, including File Search for file contents and Database metadata Search for database metadata (such as item names and types). Wildcards, precise names, regular expressions, and incremental matching are among the characteristics that the search functionality enables.

Kowsalya
Kowsalya
Hi, I'm Kowsalya a B.Com graduate and currently working as an Author at Govindhtech Solutions. I'm deeply passionate about publishing the latest tech news and tutorials that bringing insightful updates to readers. I enjoy creating step-by-step guides and making complex topics easier to understand for everyone.
Index