Which is the best Collation For MySQL? With Code Example

Collation For MySQL

In MySQL, a collation is a set of rules that defines how character strings are sorted and compared. It specifies the order of characters, their equivalence, and how they should be treated in comparisons, such as whether uppercase and lowercase letters are considered the same.

Relationship with Character Sets

Character sets are inherently connected to collations. Although a collation specifies the rules for comparing and sorting the characters inside a certain character set, a character set specifies the collection of characters that can be stored. Many character sets have multiple collations, and every set has at least one.

For instance, utf8mb4, which supports all Unicode characters, including emojis and many languages, may have different collations, such as utf8mb4_unicode_ci and utf8nb4. The collations would sort and compare strings with utf8mb4 characters. Older MySQL versions use utf8, a subset of utf8mb4, which doesn’t cover all UTF-8 characters. UTF8mb4 is recommended for multilingual text or VARCHAR columns.

Properties of Collations

A number of crucial characteristics for manipulating strings are defined by collations:

Case Sensitivity (CI/CS): Certain collations, such utf8mb4_unicode_ci, are case-insensitive. For sorting and comparison, ‘a’ and ‘A’ are identical. Case-sensitive collations, normally _cs or _bin, distinguish ‘a’ and ‘A’.
Accent Sensitivity (AI/AS): Some collations treat ‘a’ and ‘á’ as equal. Accents would be treated differently in collations.
Binary Collations (BIN): Collations that finish in _bin, such utf8mb4_bin, base comparisons on the characters’ numeric byte values. The byte values of ‘A’ and ‘a’ are different, and accented letters differ from their unaccented counterparts by byte value, therefore they are both case-sensitive and accent-sensitive. You can make a column sort case-sensitive by using BINARY with ORDER BY.

Levels of Definition

Multiple levels of character sets and collations can be established in MySQL, and unless specifically overridden, higher-level settings become defaults for lower levels.

Server Level: The standard collation and character set for the whole MySQL server.
Database Level: Unless otherwise noted, a new database created from scratch inherits the server’s default collation and character set. By default, all tables in this database will thereafter inherit these settings.
Table Level: Upon creation, a table takes on the default collation and character set of the database. Additionally, these can be changed for specific tables.
Column Level: Table columns can have their own character set and collation. The most precise control.

Sorting and comparisons (e.g., using ORDER BY, GROUP BY, and WHERE clauses like LIKE and REGEXP) are carried out on data in a column according to the particular collation set for that column.

Impact on Operations

SQL query performance is greatly impacted by the selected collation:

ORDER BY Clause: Sorts results. Latin1_german1_ci and latin1_german2_ci may order ‘Müller’ differently.
WHERE Clause and Comparisons: The WHERE Clause and Comparisons establish the evaluation of criteria such as LIKE, REGEXP, <, >, and =. WHERE name = ‘john’ matches ‘John’, ‘JOHN’, or ‘john’ in case-insensitive collations. A case-sensitive collation will only match ‘john’. The BINARY keyword in WHERE clauses forces case-sensitive byte-by-byte comparison independent of column collation.
Clause GROUP BY: Affects the formation of discrete groupings. Both ‘Apple’ and ‘apple’ would belong to the same group if a column were grouped using a collation that is not affected by case.

Choosing a Collation

Due to its complete character support and widely accepted sorting rules, utf8mb4 with a Unicode-based collation like utf8mb4_unicode_ci or utf8mb4_general_ci is a good default for generic use, especially in multi Use utf8mb4_bin for exact byte-level comparisons, such as hash values or codes. The best option relies on data needs and languages.

Code Example: Demonstrating Collation Impact

To demonstrate the impact of collations, in particular case sensitivity, let’s look at a straightforward example.

Next, make a table and database with two VARCHAR columns and distinct collations for each:

-- Create a database with default UTF8MB4 character set and a common case-insensitive collation
CREATE DATABASE IF NOT EXISTS Collation_Demo
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

USE Collation_Demo;

-- Create a table with two columns, each having a different collation
CREATE TABLE IF NOT EXISTS Products (
    ProductID INT AUTO_INCREMENT PRIMARY KEY,
    ProductName_CI VARCHAR(50) COLLATE utf8mb4_unicode_ci NOT NULL, -- Case-insensitive collation
    ProductName_CS VARCHAR(50) COLLATE utf8mb4_bin NOT NULL        -- Case-sensitive (binary) collation
);

-- Insert some sample data
INSERT INTO Products (ProductName_CI, ProductName_CS) VALUES
('Apple', 'Apple'),
('apple', 'apple'),
('Banana', 'Banana'),
('banana', 'banana'),
('Zebra', 'Zebra');

Now, let’s query the Products table and observe the results when sorting and filtering using the different collations.

-- 1. Sorting (ORDER BY) with case-insensitive collation (ProductName_CI)
SELECT ProductID, ProductName_CI
FROM Products
ORDER BY ProductName_CI;
-- Expected output: 'Apple' and 'apple' are grouped together, order might depend on internal storage but they are treated as equivalent for sorting.
-- Example result:
-- ProductID | ProductName_CI
-- ----------|--------------
-- 1         | Apple
-- 2         | apple
-- 3         | Banana
-- 4         | banana
-- 5         | Zebra

-- 2. Sorting (ORDER BY) with case-sensitive collation (ProductName_CS)
SELECT ProductID, ProductName_CS
FROM Products
ORDER BY ProductName_CS;
-- Expected output: 'Apple' comes before 'Banana', and 'apple' comes after 'Apple'. The binary collation distinguishes between cases.
-- Example result:
-- ProductID | ProductName_CS
-- ----------|--------------
-- 1         | Apple
-- 3         | Banana
-- 5         | Zebra
-- 2         | apple
-- 4         | banana

-- 3. Filtering (WHERE) with case-insensitive collation (ProductName_CI)
SELECT ProductID, ProductName_CI
FROM Products
WHERE ProductName_CI = 'Apple';
-- Expected output: Both 'Apple' and 'apple' rows are returned because the comparison is case-insensitive.
-- Example result:
-- ProductID | ProductName_CI
-- ----------|--------------
-- 1         | Apple
-- 2         | apple

-- 4. Filtering (WHERE) with case-sensitive collation (ProductName_CS)
SELECT ProductID, ProductName_CS
FROM Products
WHERE ProductName_CS = 'Apple';
-- Expected output: Only the 'Apple' row is returned because the comparison is case-sensitive.
-- Example result:
-- ProductID | ProductName_CS
-- ----------|--------------
-- 1         | Apple

This example shows how MySQL collations affect string operations like sorting and comparison, emphasizing the necessity of data-drive collation selection.

Page Content

Tutorials