Text and Pattern matching in Postgres

Text and Pattern matching in Postgres

Text matching is a pretty common problem while working with SQL. LIKE, SIMILAR TO and POSIX are pretty common querying use cases. But, taking advantage of indexes when using these is something to pay attention to. A simple b-tree index on a text/varchar column doesn't utilize the index as evident below.

testdb=#  \d test_table_2
                                        Table "public.test_table_2"
   Column    |          Type          | Collation | Nullable |                   Default
-------------+------------------------+-----------+----------+---------------------------------------------
 id          | integer                |           | not null | nextval('your_table_name_id_seq'::regclass)
 domain_name | character varying(255) |           |          |
Indexes:
    "your_table_name_pkey" PRIMARY KEY, btree (id)
    "test_index" btree (domain_name)
testdb=#
testdb=#
testdb=#  explain analyze select * from test_table_2 where domain_name like 'abcd%';
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..13576.05 rows=100 width=25) (actual time=78.204..80.452 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on test_table_2  (cost=0.00..12566.05 rows=42 width=25) (actual time=50.570..50.571 rows=0 loops=3)
         Filter: ((domain_name)::text ~~ 'abcd%'::text)
         Rows Removed by Filter: 333443
 Planning Time: 2.578 ms
 Execution Time: 80.777 ms
(8 rows)

This is because most databases are not initialized with locale C and are unable to utilize the default operator class for pattern matching.

If you do not already know, you need to be clear about 2 terms - Locale and Operator class.

Locale:
Locale refers to the specific regional or cultural settings used by a computer system, including language, date and time formats, and collation rules. Each locale has its own set of rules for sorting and comparing characters.
You can check the locale that the database uses via the below.

testdb=# SHOW LC_COLLATE;
 lc_collate
-------------
 en_US.UTF-8
(1 row)

Operator class:
An operator class in PostgreSQL is a way to define a custom sorting and comparison behaviour for a specific data type, such as text, integer, or double. Operator classes are used to create custom functions that can be used as operators in SQL queries, allowing you to extend the functionality of the default operators.

If you wish to read more, here is a very beautiful explanation on operator classes. So, basically if your DB is initialized with locale C, you can leverage the default operator classes for ordinary comparisons as well as pattern matching expressions. But, if your DB is initialized with some other locale, you can leverage the default operator classes for ordinary comparisons but not for pattern matching expressions. For pattern matching expressions, you can use some inbuilt operator classes beside the default operator class. Here is an excerpt from Postgres documentation:

*The operator classes text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops support B-tree indexes on the types text, varchar, and char respectively. The difference from the default operator classes is that the values are compared strictly character by character rather than according to the locale-specific collation rules. This makes these operator classes suitable for use by queries involving pattern matching expressions (LIKE or POSIX regular expressions) when the database does not use the standard “C” locale. As an example, you might index a varchar column like this:
CREATE INDEX test_index ON test_table (col varchar_pattern_ops);

Note that you should also create an index with the default operator class if you want queries involving ordinary <, <=, >, or >= comparisons to use an index. Such queries cannot use the xxx_pattern_ops operator classes. (Ordinary equality comparisons can use these operator classes, however.) It is possible to create multiple indexes on the same column with different operator classes. If you do use the C locale, you do not need the xxx_pattern_ops operator classes, because an index with the default operator class is usable for pattern-matching queries in the C locale.*

So, now when you create an index with the specific operator class like below, it is evident that index scan kicks in.

testdb=# CREATE INDEX test_index ON test_table_2 (domain_name varchar_pattern_ops);
CREATE INDEX
testdb=#
testdb=# \d test_table_2
                                        Table "public.test_table_2"
   Column    |          Type          | Collation | Nullable |                   Default
-------------+------------------------+-----------+----------+---------------------------------------------
 id          | integer                |           | not null | nextval('your_table_name_id_seq'::regclass)
 domain_name | character varying(255) |           |          |
Indexes:
    "your_table_name_pkey" PRIMARY KEY, btree (id)
    "test_index" btree (domain_name varchar_pattern_ops)

testdb=#
testdb=#  explain analyze select * from test_table_2 where domain_name like 'abcd%';
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Index Scan using test_index on test_table_2  (cost=0.42..8.45 rows=100 width=25) (actual time=0.488..0.489 rows=0 loops=1)
   Index Cond: (((domain_name)::text ~>=~ 'abcd'::text) AND ((domain_name)::text ~<~ 'abce'::text))
   Filter: ((domain_name)::text ~~ 'abcd%'::text)
 Planning Time: 86.148 ms
 Execution Time: 0.517 ms
(5 rows)

So here, in the context of text_pattern_ops, the operator class is used to define custom comparison and sorting behavior for the text data type when performing pattern-matching queries using the LIKE operator or POSIX regular expressions.

So far so good. When the query is of the form LIKE 'abcd%', there is no problem since index utilization happens without issues as evident above.

But, what if the query is of the form LIKE '%abcd' for the same index as above?

testdb=#  explain analyze select * from test_table_2 where domain_name like '%abcd';
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..13576.05 rows=100 width=25) (actual time=69.966..71.298 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on test_table_2  (cost=0.00..12566.05 rows=42 width=25) (actual time=46.064..46.064 rows=0 loops=3)
         Filter: ((domain_name)::text ~~ '%abcd'::text)
         Rows Removed by Filter: 333443
 Planning Time: 0.207 ms
 Execution Time: 71.753 ms
(8 rows)

Clearly, index scan is not done and we see a sequential scan happening.

Why is this?

This is because

B-tree indexes are well-suited for handling ordered data and range queries, but they have limitations when it comes to wildcard matching at the beginning of the search pattern. In B-tree indexes, the index entries are sorted in a specific order, allowing efficient range queries by traversing the tree structure. However, wildcard matching at the beginning of the pattern requires scanning the entire index because the index entries are ordered based on their values from left to right.

When you use a LIKE query with a wildcard character at the end, such as "LIKE 'abcd%'", the B-tree index can efficiently determine the range of values that satisfy the query condition, starting from the specified prefix. It can traverse the index entries in the sorted order and efficiently identify the matching values.

On the other hand, when you use a LIKE query with a wildcard character at the beginning, such as "LIKE '%abcd'", the B-tree index cannot effectively leverage its sorting order to narrow down the search space. As a result, it needs to scan the entire index to find the matching values, which can be less efficient compared to a sequential scan of the table itself.

To put it simply, if the wildcard character is at the end, I know the starting characters so I can traverse the tree from the root till the point where the wildcard character appears so I can use the index.

But, if the wildcard character is at the beginning, how do I traverse the tree since I do not know which characters might come in the beginning and how many characters might come, so a tree traversal is not possible and hence index is not used.

To deal with such cases, we can use another type of indexes called trigram indexes.

Trigram indexes work by breaking up text in trigrams or 3-letter sequences and the actual indexes themselves must be a GIN or GIST index.

GIST helps with more cases than GIN, takes less index building time but more querying time.

GIN takes more index building time but less querying time.

So, which one to use is completely dependent on your use case. For most simple cases, GIN outperforms GIST by a large margin.

Also, before you use them, you will need to enable pg_trgm extension in your DB.

Here is a very good explanation on Trigram Indexes. Postgres' documentation is pretty lucid as well.

testdb=# create index test_index on test_table_2 using gin (domain_name gin_trgm_ops);
CREATE INDEX
testdb=#
testdb=#
testdb=# \d test_table_2
                                        Table "public.test_table_2"
   Column    |          Type          | Collation | Nullable |                   Default
-------------+------------------------+-----------+----------+---------------------------------------------
 id          | integer                |           | not null | nextval('your_table_name_id_seq'::regclass)
 domain_name | character varying(255) |           |          |
Indexes:
    "your_table_name_pkey" PRIMARY KEY, btree (id)
    "test_index" gin (domain_name gin_trgm_ops)

testdb=#
testdb=#
testdb=# explain analyze select * from test_table_2 where domain_name like 'abcd%';
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test_table_2  (cost=68.78..435.05 rows=100 width=25) (actual time=2.136..2.139 rows=0 loops=1)
   Recheck Cond: ((domain_name)::text ~~ 'abcd%'::text)
   Rows Removed by Index Recheck: 4
   Heap Blocks: exact=4
   ->  Bitmap Index Scan on test_index  (cost=0.00..68.75 rows=100 width=0) (actual time=1.679..1.681 rows=4 loops=1)
         Index Cond: ((domain_name)::text ~~ 'abcd%'::text)
 Planning Time: 3.563 ms
 Execution Time: 2.262 ms
(8 rows)

testdb=#
testdb=#
testdb=# explain analyze select * from test_table_2 where domain_name like '%abcd';
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test_table_2  (cost=52.78..419.05 rows=100 width=25) (actual time=1.013..1.014 rows=0 loops=1)
   Recheck Cond: ((domain_name)::text ~~ '%abcd'::text)
   Rows Removed by Index Recheck: 2
   Heap Blocks: exact=2
   ->  Bitmap Index Scan on test_index  (cost=0.00..52.75 rows=100 width=0) (actual time=0.839..0.840 rows=2 loops=1)
         Index Cond: ((domain_name)::text ~~ '%abcd'::text)
 Planning Time: 0.171 ms
 Execution Time: 1.056 ms
(8 rows)

After adding trigram index, it is clear that index is utilized for scanning in both the cases - wildcard at the beginning and wildcard towards the end. In comparison with a normal B-Tree index with xxx_pattern_ops operator class, for LIKE 'abcd%' type queries, the former does perform better but then the former has limited use cases.Trigram indexes give you an index advantage in a wide variety of cases on the other hand and come in handy in some really tricky situations. So, choose wisely and see if you can leverage both for your use case.

Happy indexing!