What Are The GROUP BY Clause in PostgreSQL With Example

GROUP BY Clause

In PostgreSQL, the GROUP BY clause is used to group a table’s rows according to the values of one or more given expressions. This enables each group to be subject to aggregate functions. When GROUP BY is used, a single group is created from all rows that have the same set of values for the grouping expressions. Because it enables computations like COUNT, SUM, MAX, MIN, and AVG to be applied to each unique group, yielding a single value for the entire group rather than for individual rows, this clause is essential for data summarisation.

Following the WHERE clause, the GROUP BY clause accepts a list of expressions, which may include more complicated or column names. Since there might be several possible values for such a column within a group, PostgreSQL will raise an error if non-aggregate expressions in the SELECT list are not also included in the GROUP BY list. This is a crucial criterion when using GROUP BY. For grouping reasons, NULL values are gathered into a single group and are regarded as equal.

Core Functionality and Syntax

The full input set of records is divided into many groups using the GROUP using clause. Records are grouped together if they have the same set of values for the specified grouping expressions. A different group is then identified by each unique combination of these variables.

General statement of GROUP BY clause: SELECT choose_list FROM table_expression [condition], grouping_expression2,…] GROUP BY grouping_expression1.

Interaction with Aggregate Functions

Aggregate functions, which calculate a set of records and return a single value for the entire set, are nearly always used in conjunction with GROUP BY. Typical aggregate functions consist of:

COUNT(*): The PostgreSQL COUNT() aggregate function summarises data by returning the total number of rows in a group or the complete result set. COUNT() counts all rows, including those with NULL values in other columns, unlike other aggregate functions like COUNT(column_name).

SUM(expression): PostgreSQL’s SUM(expression) aggregate function totals a numeric expression for a set of input values. With a GROUP BY clause, it calculates this sum for each designated row group and returns a single total. SUM(expression) returns one sum for the entire table or the rows selected by a WHERE clause if no GROUP BY clause is present.

AVG(expression): An essential aggregate function in PostgreSQL, the AVG(expression) function calculates the average (arithmetic mean) of all non-null input values in a given set of rows. The term “simple statistic” is typically used to describe one of the most widely used aggregate functions.

MAX(expression): PostgreSQL’s MAX(expression) aggregate method finds the greatest integer, text, date/time, or enum expression across input rows. This versatile function supports inet, interval, money, oid, pg_lsn, tid, xid8, and arrays of various types.

MIN(expression): A core aggregate function in PostgreSQL, MIN(expression) finds the least non-NULL value from a set of input values. MIN() aggregates many input rows to deliver one output. It works with any numeric, text, date/time, or enumerated type, as well as inet, interval, money, oid, pg_lsn, tid, xid8, and arrays of these types. The function always returns the same data type as its input expression.

STRING_AGG(expression, delimiter): PostgreSQL’s STRING_AGG(expression, delimiter) aggregate function concatenates non-null input values from a set of rows into a single string with a delimiter. This function is great for text summarisation and presentation. It takes two arguments: the expression to concatenate and the delimiter string to separate the values in the output.

REGR_SLOPE(Y, X): PostgreSQL’s specialised aggregate function REGR_SLOPE(Y, X) calculates the slope of the least-squares-fit linear equation based on (X, Y) pairs. PostgreSQL’s aggregate functions include “simple statistics” like it. The function takes two double-precision data types, Y and X, and returns a slope value.

Because both efficiently find unique values for X, GROUP BY and SELECT DISTINCT X are functionally similar. GROUP BY combines output for further processing, while DISTINCT removes duplicates.

Code Example:

DROP TABLE IF EXISTS employees;
CREATE TABLE employees (
    id SERIAL PRIMARY KEY,
    name TEXT,
    department TEXT,
    salary NUMERIC,
    hire_date DATE
);

INSERT INTO employees (name, department, salary, hire_date) VALUES
('Alice','HR',50000,'2020-01-10');

SELECT COUNT(*) AS total_employees FROM employees;
SELECT department, COUNT(*) AS emp_count FROM employees GROUP BY department;
SELECT SUM(salary) AS total_salary FROM employees;

Output:

DROP TABLE
CREATE TABLE
INSERT 0 1
 total_employees 
-----------------
               1
(1 row)

 department | emp_count 
------------+-----------
 HR         |         1
(1 row)

 total_salary 
--------------
        50000
(1 row)

List Restrictions

Expressions obtained from the expressions that were grouped. Using a non-aggregated column without the GROUP BY clause will result in unclear output due to multiple values inside a group. The HAVING clause filters groups per criteria like the WHERE clause filters rows. Their application point in the logical processing of the query is the primary distinction:

WHERE clause: Before any grouping and aggregation takes place, the WHERE clause filters the input rows. Aggregate functions are not allowed in it.

HAVING clause: After GROUP BY has created the groups and aggregate functions have calculated their values, the HAVING clause filters the groupings. Consequently, aggregate functions are frequently included in the conditions of HAVING clauses.

PostgreSQL handles the entire table as a single group if a query invokes aggregate functions without including a GROUP BY clause. Unless a WHERE or HAVING clause filters out all rows, which would still result in one record for COUNT(*) returning zero, the query will return exactly one row in these situations.

Logical Query Processing Order

Creating efficient queries requires an understanding of the order of operations. For a SELECT query, the streamlined logical flow of steps is usually:

FROM clause: In PostgreSQL, the FROM clause specifies the rows the SELECT query will act on. This can be tables, views, functions that return sets, subqueries, or VALUES lists. In the FROM clause, multiple can be separated by commas to create a cross-join (Cartesian product) of their rows, or explicit JOIN clauses (e.g., INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN) to combine their rows.

WHERE clause: A key part of the SELECT, UPDATE, and DELETE statements in PostgreSQL, the WHERE clause is used mainly to filter input rows according to a given criterion. In a SELECT statement, it usually comes right after the FROM clause.

HAVING clause: PostgreSQL’s HAVING clause is essential to SELECT statements, filtering rows by post-aggregation conditions. The WHERE clause filters input rows before grouping or aggregate computations, while the HAVING clause filters aggregated groups of rows after they are generated.

SELECT list evaluation: In PostgreSQL, a SELECT query’s select-list determines the query result’s quantity, names, and types of selected data. A comma-separated list of fields or SQL expressions is supplied after the SELECT keyword.

DISTINCT: PostgreSQL’s DISTINCT keyword removes duplicate rows from SELECT statements. PostgreSQL returns only unique combinations of values from all SELECT columns when DISTINCT is used. SELECT ALL returns all rows, including duplicates, by default.

Window functions: PostgreSQL window functions are strong aggregate-like functions that calculate a set of linked table rows, called a “window,” without grouping them into a single output record. Window functions allow each input row to preserve its identity in the output while still accessing data from other rows in its designated window, unlike normal aggregate functions (e.g., SUM, AVG, COUNT).

ORDER BY clause: The PostgreSQL ORDER BY clause sorts query results in a defined order. SELECT statements without ORDER BY clauses may return rows in any order, depending on disc placement or join algorithms.

OFFSET and LIMIT: PostgreSQL’s OFFSET and LIMIT clauses retrieve a subset of SELECT rows. The LIMIT clause sets the maximum number of rows to return, while the OFFSET clause specifies how many rows to skip from the beginning of the result set before returning results.

Advanced Grouping Operations

To enable more intricate aggregations, PostgreSQL adds grouping sets to the fundamental GROUP BY feature. These consist of:

GROUPING SETS: In a single query, specify several grouping criteria. Aggregates are calculated for each of the distinct grouping sets that are used to group the data.

ROLLUP: For hierarchical data analysis, ROLLUP is a shorthand for GROUPING SETS that creates groupings for a specified list of expressions and all of their prefixes (e.g., totals by subcategory, category, and grand total).

CUBE: An additional abbreviation for GROUPING SETS that produces cross-tabulations by grouping all potential subsets of a given list of expressions. To help differentiate output rows from various grouping levels, the SELECT list can be used to specify which GROUP BY expressions are excluded from the current grouping set using the GROUPING() function.

Performance Considerations

In order to determine the least expensive execution plan, PostgreSQL’s query planner examines tables and gathers statistics. Sort-based or hash-based aggregation can be used to carry out GROUP BY procedures. Sort-based aggregation is utilised for very high numbers of distinct groups, but hash-based aggregation is chosen for in-memory processes when the estimated number of separate groups is not particularly great. Additionally, PostgreSQL supports parallel aggregation, which greatly enhances efficiency for big data sets by having multiple processes carry out partial aggregations that are subsequently joined by a leader process.

GROUP BY queries are easy to execute with DBeaver’s Grouping panel. Select aggregation functions like COUNT, SUM, AVG, MIN, and MAX and drag and drop columns to group. It also allows custom aliases and functions.

Conclusion

The GROUP BY clause in PostgreSQL aggregates rows with equal values in specified columns to summarise data using aggregate functions like COUNT, SUM, AVG, MIN, MAX, STRING_AGG, and statistical approaches like REGR_SLOPE. It filters aggregated groups with the HAVING clause, while the WHERE clause filters rows before grouping. Aggregate functions treat the entire table as a group without GROUP BY.

GROUPING SETS, ROLLUP, and CUBE provide complicated multidimensional analysis, and PostgreSQL’s optimiser automatically chooses optimal execution strategies like hash or sort aggregation for huge datasets with parallel processing. Developers may construct precise, high-performance queries by understanding the logical query order FROM, WHERE, GROUP BY, HAVING, SELECT, DISTINCT, window functions, ORDER BY, OFFSET, LIMIT. Thus, mastering GROUP BY and its extensions is vital for informative summaries, hierarchical analysis, and efficient data processing.

Page Content

Tutorials