Обсуждение: Re: Allow ILIKE forward matching to use btree index

Поиск
Список
Период
Сортировка

Re: Allow ILIKE forward matching to use btree index

От
Jeff Davis
Дата:
On Wed, 2025-01-15 at 01:40 +0900, Yugo NAGATA wrote:
> > > For example, "t ~~ '123foo%'" is converted to "(t >= '123foo' AND
> > > t < '123fop')"
> > > and index scan can be used for this condition. On the other hand,
> > > "t ~~* '123foo'"
> > > cannot be converted and sequential scan is used.
> > >
> > > Even in this case, we can use a bitmap index scan for the
> > > condition
> > > "(t >= '123f' AND t < '123g') OR "(t >= '123F' AND t < '123G')"
> > > followed by
> > > recheck by the original condition "t ~~* '123foo'", and this
> > > could be faster
> > > than seqscan.

In theory, there could be many OR clauses:

   (t >= '123foo' AND t < '123fop') OR
   (t >= '123foo' AND t < '123fop') OR
   (t >= '123foo' AND t < '123fop') OR
   (t >= '123foo' AND t < '123fop') OR
   (t >= '123foo' AND t < '123fop') OR
   (t >= '123foo' AND t < '123fop') OR
> > > The patch 0001 enables get_index_clause_from_support to receive
> > > OR clauses
> > > from support functions and use them to build bitmap index scan
> > > later. OR clauses
> > > returned by support functions are collected in the new argument
> > > 'orclauses" of
> > > match_restriction_clauses_to_index(), and they are passed to
> > > generate_bitmap_or_paths() later to build bitmap scan paths.

If I understand correctly, this

> > > The patch 0002 modifies the support function to return OR clauses
> > > as an example
> > > above when ILIKE's pattern has case-varying characters in forward
> > > matching. The
> > > OR clauses contains two index conditions for the upper and the
> > > lower letter of
> > > the first case-varying character in the pattern respectively.
> > > Although the
> > > subsequent characters are not considered in the index scans, it
> > > could enough be
> > > faster then sequent scan.
> > >
> > > Here is an example.
> > >
> > > 1. Create a table with random text records
> > >
> > > =# CREATE TABLE tbl (t text);
> > > =# INSERT INTO tbl SELECT CASE WHEN i%2=1 THEN upper(x) ELSE x
> > > END
> > >                   FROM (SELECT i, md5(i::text) x FROM
> > > generate_series(1,5000000) i);
> > >
> > > 2. Create an index
> > > =# CREATE INDEX ON tbl (t text_pattern_ops);
> > >
> > > 3. Before applying patch: Seq Scan was selected. It takes about 4
> > > sec.
> > >
> > > =# EXPLIN ANALYZE SELECT * FROM tbl WHERE t ILIKE '12ab%';
> >
> > I am sorry, the example in my previous main was wrong. I showed the
> > plan
> > with enable_index_scan = off. Indead, if the pattern starts with
> > case-insensitive characters like '12', an index scan can be used
> > even with
> > ILIKE.
> >
> > postgres=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE t ILIKE 'abc%';
> >                                                        QUERY
> > PLAN                                                       
> > -------------------------------------------------------------------
> > ------------------------------------------------------
> >  Gather  (cost=1000.00..69129.61 rows=500 width=33) (actual
> > time=36.317..4034.770 rows=1188 loops=1)
> >    Workers Planned: 2
> >    Workers Launched: 2
> >    Buffers: shared hit=99 read=41939
> >    ->  Parallel Seq Scan on tbl  (cost=0.00..68079.61 rows=208
> > width=33) (actual time=19.908..4023.668 rows=396 loops=3)
> >          Filter: (t ~~* 'abc%'::text)
> >          Rows Removed by Filter: 1666271
> >          Buffers: shared hit=99 read=41939
> >  Planning Time: 0.726 ms
> >  Execution Time: 4035.101 ms
> > (10 rows)
> >
> > 4. After applying patch: Bitmap Index Scan was selected.
> >
> > postgres=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE t ILIKE 'abc%';
> >                                                                    
> >        QUERY
> > PLAN                                                               
> >           
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > ------------------------
> >  Bitmap Heap Scan on tbl  (cost=12563.66..58314.53 rows=500
> > width=33) (actual time=156.485..1266.789 rows=1188 loops=1)
> >    Recheck Cond: (((t ~>=~ 'A'::text) AND (t ~<~ 'B'::text) AND (t
> > ~~* 'abc%'::text)) OR ((t ~>=~ 'a'::text) AND (t ~<~ 'b'::text) AND
> > (t ~~* 'abc%'::text)))
> >    Filter: (t ~~* 'abc%'::text)
> >    Rows Removed by Filter: 311473
> >    Heap Blocks: exact=42010
> >    Buffers: shared hit=1 read=44600
> >    ->  BitmapOr  (cost=12563.66..12563.66 rows=297029 width=0)
> > (actual time=136.979..136.980 rows=0 loops=1)
> >          Buffers: shared hit=1 read=2590
> >          ->  Bitmap Index Scan on tbl_t_idx  (cost=0.00..6281.71
> > rows=148515 width=0) (actual time=80.548..80.549 rows=158375
> > loops=1)
> >                Index Cond: ((t ~>=~ 'A'::text) AND (t ~<~
> > 'B'::text))
> >                Buffers: shared read=1272
> >          ->  Bitmap Index Scan on tbl_t_idx  (cost=0.00..6281.71
> > rows=148515 width=0) (actual time=56.427..56.427 rows=157042
> > loops=1)
> >                Index Cond: ((t ~>=~ 'a'::text) AND (t ~<~
> > 'b'::text))
> >                Buffers: shared hit=1 read=1318
> >  Planning Time: 0.592 ms
> >  Execution Time: 1267.162 ms
> > (16 rows)
> >
> >  
> > > What do you think about it?
> > >
> > > I think another application of OR-clause returning support
> > > function might be
> > > allowing to use an index for NOT LIKE (!~~) expression  because,
> > > for example,
> > > "t !~~ '123foo%'" could be converted to "(t < '123foo' OR t >=
> > > '123fooz')".
> > > (The second condition is a bit loose but this would be safe and
> > > not problem
> > > since tuples are filtered by the original condition after the
> > > index scan.)
> > > However, it would not very useful unless the distribution is much
> > > skew so that
> > > NOT LIKE expression's selectivity is enough small.
>
> I added a new patch 0003 that enables ILIKE forward matching to use a
> SP-Gist index.
> Similar to 0002, this generates BitmapOr Index Scan for two index
> clauses that use
> "^@" operator for upper letter and lower letter pattern respectively.
>
> * Before applying the patch:
>
> postgres=# explain analyze select* from tbl where t ilike 'abc%';
>                                                       QUERY
> PLAN                                                     
> ---------------------------------------------------------------------
> -------------------------------------------------
>  Gather  (cost=1000.00..18899.52 rows=101 width=33) (actual
> time=41.799..930.382 rows=253 loops=1)
>    Workers Planned: 2
>    Workers Launched: 2
>    Buffers: shared hit=96 read=12533
>    ->  Parallel Seq Scan on tbl  (cost=0.00..17889.42 rows=42
> width=33) (actual time=26.041..917.570 rows=84 loops=3)
>          Filter: (t ~~* 'abc%'::text)
>          Rows Removed by Filter: 336582
>          Buffers: shared hit=96 read=12533
>  Planning Time: 0.690 ms
>  Execution Time: 930.545 ms
> (10 rows)
>
>
> * After applying the patch:
>
> postgres=# explain analyze select* from tbl where t ilike 'abc%';
>                                                              QUERY
> PLAN                                                             
> ---------------------------------------------------------------------
> ----------------------------------------------------------------
>  Bitmap Heap Scan on tbl  (cost=3307.96..16702.11 rows=101 width=33)
> (actual time=69.449..215.636 rows=253 loops=1)
>    Recheck Cond: (((t ^@ 'A'::text) AND (t ~~* 'abc%'::text)) OR ((t
> ^@ 'a'::text) AND (t ~~* 'abc%'::text)))
>    Filter: (t ~~* 'abc%'::text)
>    Rows Removed by Filter: 62992
>    Heap Blocks: exact=12437
>    Buffers: shared hit=18529
>    ->  BitmapOr  (cost=3307.96..3307.96 rows=61212 width=0) (actual
> time=62.567..62.568 rows=0 loops=1)
>          Buffers: shared hit=6092
>          ->  Bitmap Index Scan on tbl_t_idx  (cost=0.00..1653.96
> rows=30606 width=0) (actual time=41.893..41.893 rows=31536 loops=1)
>                Index Cond: (t ^@ 'A'::text)
>                Buffers: shared hit=2461
>          ->  Bitmap Index Scan on tbl_t_idx  (cost=0.00..1653.96
> rows=30606 width=0) (actual time=20.671..20.671 rows=31709 loops=1)
>                Index Cond: (t ^@ 'a'::text)
>                Buffers: shared hit=3631
>  Planning Time: 1.414 ms
>  Execution Time: 215.903 ms
> (16 rows)
>
>


Re: Allow ILIKE forward matching to use btree index

От
Jeff Davis
Дата:
My apologies, I sent the previous email prematurely. Let me try again:

On Wed, 2025-01-15 at 14:34 -0800, Jeff Davis wrote:
> On Wed, 2025-01-15 at 01:40 +0900, Yugo NAGATA wrote:
> > > > For example, "t ~~ '123foo%'" is converted to "(t >= '123foo'
> > > > AND
> > > > t < '123fop')"
> > > > and index scan can be used for this condition. On the other
> > > > hand,
> > > > "t ~~* '123foo'"
> > > > cannot be converted and sequential scan is used.
> > > >
> > > > Even in this case, we can use a bitmap index scan for the
> > > > condition
> > > > "(t >= '123f' AND t < '123g') OR "(t >= '123F' AND t < '123G')"
> > > > followed by
> > > > recheck by the original condition "t ~~* '123foo'", and this
> > > > could be faster
> > > > than seqscan.

In theory, there could be many OR clauses:

  (t >= '123foo' AND t < '123fop') OR
  (t >= '123Foo' AND t < '123Fop') OR
  (t >= '123fOo' AND t < '123fOp') OR
  (t >= '123FOo' AND t < '123FOp') OR
  ...

How should that be limited?

> > >
Regards,
    Jeff Davis




Re: Allow ILIKE forward matching to use btree index

От
Yugo NAGATA
Дата:
On Wed, 15 Jan 2025 14:40:19 -0800
Jeff Davis <pgsql@j-davis.com> wrote:

> My apologies, I sent the previous email prematurely. Let me try again:
> 
> On Wed, 2025-01-15 at 14:34 -0800, Jeff Davis wrote:
> > On Wed, 2025-01-15 at 01:40 +0900, Yugo NAGATA wrote:
> > > > > For example, "t ~~ '123foo%'" is converted to "(t >= '123foo'
> > > > > AND
> > > > > t < '123fop')" 
> > > > > and index scan can be used for this condition. On the other
> > > > > hand,
> > > > > "t ~~* '123foo'"
> > > > > cannot be converted and sequential scan is used. 
> > > > > 
> > > > > Even in this case, we can use a bitmap index scan for the
> > > > > condition 
> > > > > "(t >= '123f' AND t < '123g') OR "(t >= '123F' AND t < '123G')"
> > > > > followed by
> > > > > recheck by the original condition "t ~~* '123foo'", and this
> > > > > could be faster
> > > > > than seqscan. 
> 
> In theory, there could be many OR clauses:
> 
>   (t >= '123foo' AND t < '123fop') OR
>   (t >= '123Foo' AND t < '123Fop') OR
>   (t >= '123fOo' AND t < '123fOp') OR
>   (t >= '123FOo' AND t < '123FOp') OR
>   ...
> 
> How should that be limited?

Instead of generating complete patterns considering every case-varying characters,
two clauses considering only the first case-varying character are generated.

For example, for the condition "t ILIKE '123foo%'", the generated condition is
"(t >= '123f' AND t < '123g') OR "(t >= '123F' AND t < '123G')".

Rows meeting "(t >= '123f' AND t < '123g')" includes those whose "t" start with
'123f', that is meeting the following;

>   (t >= '123foo' AND t < '123fop') OR
>   (t >= '123fOo' AND t < '123fOp') OR
>   ...

, and rows meeting "(t >= '123F' AND t < '123G')" includes those whose "t" start
with '123F', that is meeting the following;

>   (t >= '123Foo' AND t < '123Fop') OR
>   (t >= '123FOo' AND t < '123FOp') OR
>   ...

It is required that the second case-varying character and later are checked after
the index scan, and some rows are filtered, but it will be still faster than full
sequential scan.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Allow ILIKE forward matching to use btree index

От
Jeff Davis
Дата:
On Thu, 2025-01-16 at 14:53 +0900, Yugo NAGATA wrote:
> Instead of generating complete patterns considering every case-
> varying characters,
> two clauses considering only the first case-varying character are
> generated.

Did you consider other approaches that integrate more deeply into the
indexing infrastructure? This feels almost like a skip scan, which
Petter Geoghegan is working on. If you model the disjunctions as skips,
and provide the right API that the index AM can use, then there would
be no limit.

For example:

    start scanning at '123FOO'
    when you encounter '123FOP' skip to '123FOo'
    continue scanning
    when you encounter '123FOp' skip to '123FoO'
    ...

Also, I'm working on better Unicode support to handle multiple case
variants in patterns. For instance, the Greek letter Sigma has three
case variants (one upper and two lower). What's the right API so that
the index AM knows which case variants will sort first, so that the
skips don't go backward?

Regards,
    Jeff Davis