postgresql/src/test/regress/expected/gin.out

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

300 lines
9.2 KiB
Plaintext
Raw Normal View History

--
-- Test GIN indexes.
--
Avoid full scan of GIN indexes when possible The strategy of GIN index scan is driven by opclass-specific extract_query method. This method that needed search mode is GIN_SEARCH_MODE_ALL. This mode means that matching tuple may contain none of extracted entries. Simple example is '!term' tsquery, which doesn't need any term to exist in matching tsvector. In order to handle such scan key GIN calculates virtual entry, which contains all TIDs of all entries of attribute. In fact this is full scan of index attribute. And typically this is very slow, but allows to handle some queries correctly in GIN. However, current algorithm calculate such virtual entry for each GIN_SEARCH_MODE_ALL scan key even if they are multiple for the same attribute. This is clearly not optimal. This commit improves the situation by introduction of "exclude only" scan keys. Such scan keys are not capable to return set of matching TIDs. Instead, they are capable only to filter TIDs produced by normal scan keys. Therefore, each attribute should contain at least one normal scan key, while rest of them may be "exclude only" if search mode is GIN_SEARCH_MODE_ALL. The same optimization might be applied to the whole scan, not per-attribute. But that leads to NULL values elimination problem. There is trade-off between multiple possible ways to do this. We probably want to do this later using some cost-based decision algorithm. Discussion: https://postgr.es/m/CAOBaU_YGP5-BEt5Cc0%3DzMve92vocPzD%2BXiZgiZs1kjY0cj%3DXBg%40mail.gmail.com Author: Nikita Glukhov, Alexander Korotkov, Tom Lane, Julien Rouhaud Reviewed-by: Julien Rouhaud, Tomas Vondra, Tom Lane
2020-01-17 23:11:39 +01:00
-- There are other tests to test different GIN opclasses. This is for testing
-- GIN itself.
-- Create and populate a test table with a GIN index.
create table gin_test_tbl(i int4[]) with (autovacuum_enabled = off);
create index gin_test_idx on gin_test_tbl using gin (i)
with (fastupdate = on, gin_pending_list_limit = 4096);
insert into gin_test_tbl select array[1, 2, g] from generate_series(1, 20000) g;
insert into gin_test_tbl select array[1, 3, g] from generate_series(1, 1000) g;
select gin_clean_pending_list('gin_test_idx')>10 as many; -- flush the fastupdate buffers
many
------
t
(1 row)
insert into gin_test_tbl select array[3, 1, g] from generate_series(1, 1000) g;
vacuum gin_test_tbl; -- flush the fastupdate buffers
select gin_clean_pending_list('gin_test_idx'); -- nothing to flush
gin_clean_pending_list
------------------------
0
(1 row)
-- Test vacuuming
delete from gin_test_tbl where i @> array[2];
vacuum gin_test_tbl;
-- Disable fastupdate, and do more insertions. With fastupdate enabled, most
-- insertions (by flushing the list pages) cause page splits. Without
-- fastupdate, we get more churn in the GIN data leaf pages, and exercise the
-- recompression codepaths.
alter index gin_test_idx set (fastupdate = off);
insert into gin_test_tbl select array[1, 2, g] from generate_series(1, 1000) g;
insert into gin_test_tbl select array[1, 3, g] from generate_series(1, 1000) g;
delete from gin_test_tbl where i @> array[2];
vacuum gin_test_tbl;
Fix bugs in gin_fuzzy_search_limit processing. entryGetItem()'s three code paths each contained bugs associated with filtering the entries for gin_fuzzy_search_limit. The posting-tree path failed to advance "advancePast" after having decided to filter an item. If we ran out of items on the current page and needed to advance to the next, what would actually happen is that entryLoadMoreItems() would re-load the same page. Eventually, the random dropItem() test would accept one of the same items it'd previously rejected, and we'd move on --- but it could take awhile with small gin_fuzzy_search_limit. To add insult to injury, this case would inevitably cause entryLoadMoreItems() to decide it needed to re-descend from the root, making things even slower. The posting-list path failed to implement gin_fuzzy_search_limit filtering at all, so that all entries in the posting list would be returned. The bitmap-result path used a "gotitem" variable that it failed to update in the one place where it'd actually make a difference, ie at the one "continue" statement. I think this was unreachable in practice, because if we'd looped around then it shouldn't be the case that the entries on the new page are before advancePast. Still, the "gotitem" variable was contributing nothing to either clarity or correctness, so get rid of it. Refactor all three loops so that the termination conditions are more alike and less unreadable. The code coverage report showed that we had no coverage at all for the re-descend-from-root code path in entryLoadMoreItems(), which seems like a very bad thing, so add a test case that exercises it. We also had exactly no coverage for gin_fuzzy_search_limit, so add a simplistic test case that at least hits those code paths a little bit. Back-patch to all supported branches. Adé Heyward and Tom Lane Discussion: https://postgr.es/m/CAEknJCdS-dE1Heddptm7ay2xTbSeADbkaQ8bU2AXRCVC2LdtKQ@mail.gmail.com
2020-04-03 19:15:30 +02:00
-- Test for "rare && frequent" searches
explain (costs off)
select count(*) from gin_test_tbl where i @> array[1, 999];
QUERY PLAN
-------------------------------------------------------
Aggregate
-> Bitmap Heap Scan on gin_test_tbl
Recheck Cond: (i @> '{1,999}'::integer[])
-> Bitmap Index Scan on gin_test_idx
Index Cond: (i @> '{1,999}'::integer[])
(5 rows)
select count(*) from gin_test_tbl where i @> array[1, 999];
count
-------
3
(1 row)
-- Very weak test for gin_fuzzy_search_limit
set gin_fuzzy_search_limit = 1000;
explain (costs off)
select count(*) > 0 as ok from gin_test_tbl where i @> array[1];
QUERY PLAN
---------------------------------------------------
Aggregate
-> Bitmap Heap Scan on gin_test_tbl
Recheck Cond: (i @> '{1}'::integer[])
-> Bitmap Index Scan on gin_test_idx
Index Cond: (i @> '{1}'::integer[])
(5 rows)
select count(*) > 0 as ok from gin_test_tbl where i @> array[1];
ok
----
t
(1 row)
reset gin_fuzzy_search_limit;
Avoid full scan of GIN indexes when possible The strategy of GIN index scan is driven by opclass-specific extract_query method. This method that needed search mode is GIN_SEARCH_MODE_ALL. This mode means that matching tuple may contain none of extracted entries. Simple example is '!term' tsquery, which doesn't need any term to exist in matching tsvector. In order to handle such scan key GIN calculates virtual entry, which contains all TIDs of all entries of attribute. In fact this is full scan of index attribute. And typically this is very slow, but allows to handle some queries correctly in GIN. However, current algorithm calculate such virtual entry for each GIN_SEARCH_MODE_ALL scan key even if they are multiple for the same attribute. This is clearly not optimal. This commit improves the situation by introduction of "exclude only" scan keys. Such scan keys are not capable to return set of matching TIDs. Instead, they are capable only to filter TIDs produced by normal scan keys. Therefore, each attribute should contain at least one normal scan key, while rest of them may be "exclude only" if search mode is GIN_SEARCH_MODE_ALL. The same optimization might be applied to the whole scan, not per-attribute. But that leads to NULL values elimination problem. There is trade-off between multiple possible ways to do this. We probably want to do this later using some cost-based decision algorithm. Discussion: https://postgr.es/m/CAOBaU_YGP5-BEt5Cc0%3DzMve92vocPzD%2BXiZgiZs1kjY0cj%3DXBg%40mail.gmail.com Author: Nikita Glukhov, Alexander Korotkov, Tom Lane, Julien Rouhaud Reviewed-by: Julien Rouhaud, Tomas Vondra, Tom Lane
2020-01-17 23:11:39 +01:00
-- Test optimization of empty queries
create temp table t_gin_test_tbl(i int4[], j int4[]);
Avoid full scan of GIN indexes when possible The strategy of GIN index scan is driven by opclass-specific extract_query method. This method that needed search mode is GIN_SEARCH_MODE_ALL. This mode means that matching tuple may contain none of extracted entries. Simple example is '!term' tsquery, which doesn't need any term to exist in matching tsvector. In order to handle such scan key GIN calculates virtual entry, which contains all TIDs of all entries of attribute. In fact this is full scan of index attribute. And typically this is very slow, but allows to handle some queries correctly in GIN. However, current algorithm calculate such virtual entry for each GIN_SEARCH_MODE_ALL scan key even if they are multiple for the same attribute. This is clearly not optimal. This commit improves the situation by introduction of "exclude only" scan keys. Such scan keys are not capable to return set of matching TIDs. Instead, they are capable only to filter TIDs produced by normal scan keys. Therefore, each attribute should contain at least one normal scan key, while rest of them may be "exclude only" if search mode is GIN_SEARCH_MODE_ALL. The same optimization might be applied to the whole scan, not per-attribute. But that leads to NULL values elimination problem. There is trade-off between multiple possible ways to do this. We probably want to do this later using some cost-based decision algorithm. Discussion: https://postgr.es/m/CAOBaU_YGP5-BEt5Cc0%3DzMve92vocPzD%2BXiZgiZs1kjY0cj%3DXBg%40mail.gmail.com Author: Nikita Glukhov, Alexander Korotkov, Tom Lane, Julien Rouhaud Reviewed-by: Julien Rouhaud, Tomas Vondra, Tom Lane
2020-01-17 23:11:39 +01:00
create index on t_gin_test_tbl using gin (i, j);
insert into t_gin_test_tbl
values
(null, null),
('{}', null),
('{1}', null),
('{1,2}', null),
(null, '{}'),
(null, '{10}'),
('{1,2}', '{10}'),
('{2}', '{10}'),
('{1,3}', '{}'),
('{1,1}', '{10}');
set enable_seqscan = off;
explain (costs off)
select * from t_gin_test_tbl where array[0] <@ i;
QUERY PLAN
---------------------------------------------------
Bitmap Heap Scan on t_gin_test_tbl
Recheck Cond: ('{0}'::integer[] <@ i)
-> Bitmap Index Scan on t_gin_test_tbl_i_j_idx
Index Cond: (i @> '{0}'::integer[])
(4 rows)
select * from t_gin_test_tbl where array[0] <@ i;
i | j
---+---
(0 rows)
select * from t_gin_test_tbl where array[0] <@ i and '{}'::int4[] <@ j;
i | j
---+---
(0 rows)
explain (costs off)
select * from t_gin_test_tbl where i @> '{}';
QUERY PLAN
---------------------------------------------------
Bitmap Heap Scan on t_gin_test_tbl
Recheck Cond: (i @> '{}'::integer[])
-> Bitmap Index Scan on t_gin_test_tbl_i_j_idx
Index Cond: (i @> '{}'::integer[])
(4 rows)
select * from t_gin_test_tbl where i @> '{}';
i | j
-------+------
{} |
{1} |
{1,2} |
{1,2} | {10}
{2} | {10}
{1,3} | {}
{1,1} | {10}
(7 rows)
create function explain_query_json(query_sql text)
returns table (explain_line json)
language plpgsql as
$$
begin
set enable_seqscan = off;
set enable_bitmapscan = on;
return query execute 'EXPLAIN (ANALYZE, FORMAT json) ' || query_sql;
end;
$$;
create function execute_text_query_index(query_sql text)
returns setof text
language plpgsql
as
$$
begin
set enable_seqscan = off;
set enable_bitmapscan = on;
return query execute query_sql;
end;
$$;
create function execute_text_query_heap(query_sql text)
returns setof text
language plpgsql
as
$$
begin
set enable_seqscan = on;
set enable_bitmapscan = off;
return query execute query_sql;
end;
$$;
-- check number of rows returned by index and removed by recheck
select
query,
js->0->'Plan'->'Plans'->0->'Actual Rows' as "return by index",
js->0->'Plan'->'Rows Removed by Index Recheck' as "removed by recheck",
(res_index = res_heap) as "match"
from
(values
($$ i @> '{}' $$),
($$ j @> '{}' $$),
($$ i @> '{}' and j @> '{}' $$),
($$ i @> '{1}' $$),
($$ i @> '{1}' and j @> '{}' $$),
($$ i @> '{1}' and i @> '{}' and j @> '{}' $$),
($$ j @> '{10}' $$),
($$ j @> '{10}' and i @> '{}' $$),
($$ j @> '{10}' and j @> '{}' and i @> '{}' $$),
($$ i @> '{1}' and j @> '{10}' $$)
) q(query),
lateral explain_query_json($$select * from t_gin_test_tbl where $$ || query) js,
lateral execute_text_query_index($$select string_agg((i, j)::text, ' ') from t_gin_test_tbl where $$ || query) res_index,
lateral execute_text_query_heap($$select string_agg((i, j)::text, ' ') from t_gin_test_tbl where $$ || query) res_heap;
query | return by index | removed by recheck | match
-------------------------------------------+-----------------+--------------------+-------
i @> '{}' | 7 | 0 | t
j @> '{}' | 6 | 0 | t
i @> '{}' and j @> '{}' | 4 | 0 | t
i @> '{1}' | 5 | 0 | t
i @> '{1}' and j @> '{}' | 3 | 0 | t
i @> '{1}' and i @> '{}' and j @> '{}' | 3 | 0 | t
j @> '{10}' | 4 | 0 | t
j @> '{10}' and i @> '{}' | 3 | 0 | t
j @> '{10}' and j @> '{}' and i @> '{}' | 3 | 0 | t
i @> '{1}' and j @> '{10}' | 2 | 0 | t
(10 rows)
Fix code for re-finding scan position in a multicolumn GIN index. collectMatchBitmap() needs to re-find the index tuple it was previously looking at, after transiently dropping lock on the index page it's on. The tuple should still exist and be at its prior position or somewhere to the right of that, since ginvacuum never removes tuples but concurrent insertions could add one. However, there was a thinko in that logic, to the effect of expecting any inserted tuples to have the same index "attnum" as what we'd been scanning. Since there's no physical separation of tuples with different attnums, it's not terribly hard to devise scenarios where this fails, leading to transient "lost saved point in index" errors. (While I've duplicated this with manual testing, it seems impossible to make a reproducible test case with our available testing technology.) Fix by just continuing the scan when the attnum doesn't match. While here, improve the error message used if we do fail, so that it matches the wording used in btree for a similar case. collectMatchBitmap()'s posting-tree code path was previously not exercised at all by our regression tests. While I can't make a regression test that exhibits the bug, I can at least improve the code coverage here, so do that. The test case I made for this is an extension of one added by 4b754d6c1, so it only works in HEAD and v13; didn't seem worth trying hard to back-patch it. Per bug #16595 from Jesse Kinkead. This has been broken since multicolumn capability was added to GIN (commit 27cb66fdf), so back-patch to all supported branches. Discussion: https://postgr.es/m/16595-633118be8eef9ce2@postgresql.org
2020-08-27 23:36:13 +02:00
reset enable_seqscan;
reset enable_bitmapscan;
-- re-purpose t_gin_test_tbl to test scans involving posting trees
insert into t_gin_test_tbl select array[1, g, g/10], array[2, g, g/10]
from generate_series(1, 20000) g;
select gin_clean_pending_list('t_gin_test_tbl_i_j_idx') is not null;
?column?
----------
t
(1 row)
analyze t_gin_test_tbl;
set enable_seqscan = off;
set enable_bitmapscan = on;
explain (costs off)
select count(*) from t_gin_test_tbl where j @> array[50];
QUERY PLAN
---------------------------------------------------------
Aggregate
-> Bitmap Heap Scan on t_gin_test_tbl
Recheck Cond: (j @> '{50}'::integer[])
-> Bitmap Index Scan on t_gin_test_tbl_i_j_idx
Index Cond: (j @> '{50}'::integer[])
(5 rows)
select count(*) from t_gin_test_tbl where j @> array[50];
count
-------
11
(1 row)
explain (costs off)
select count(*) from t_gin_test_tbl where j @> array[2];
QUERY PLAN
---------------------------------------------------------
Aggregate
-> Bitmap Heap Scan on t_gin_test_tbl
Recheck Cond: (j @> '{2}'::integer[])
-> Bitmap Index Scan on t_gin_test_tbl_i_j_idx
Index Cond: (j @> '{2}'::integer[])
(5 rows)
select count(*) from t_gin_test_tbl where j @> array[2];
count
-------
20000
(1 row)
explain (costs off)
select count(*) from t_gin_test_tbl where j @> '{}'::int[];
QUERY PLAN
---------------------------------------------------------
Aggregate
-> Bitmap Heap Scan on t_gin_test_tbl
Recheck Cond: (j @> '{}'::integer[])
-> Bitmap Index Scan on t_gin_test_tbl_i_j_idx
Index Cond: (j @> '{}'::integer[])
(5 rows)
select count(*) from t_gin_test_tbl where j @> '{}'::int[];
count
-------
20006
(1 row)
-- test vacuuming of posting trees
delete from t_gin_test_tbl where j @> array[2];
vacuum t_gin_test_tbl;
select count(*) from t_gin_test_tbl where j @> array[50];
count
-------
0
(1 row)
select count(*) from t_gin_test_tbl where j @> array[2];
count
-------
0
(1 row)
select count(*) from t_gin_test_tbl where j @> '{}'::int[];
count
-------
6
(1 row)
Avoid full scan of GIN indexes when possible The strategy of GIN index scan is driven by opclass-specific extract_query method. This method that needed search mode is GIN_SEARCH_MODE_ALL. This mode means that matching tuple may contain none of extracted entries. Simple example is '!term' tsquery, which doesn't need any term to exist in matching tsvector. In order to handle such scan key GIN calculates virtual entry, which contains all TIDs of all entries of attribute. In fact this is full scan of index attribute. And typically this is very slow, but allows to handle some queries correctly in GIN. However, current algorithm calculate such virtual entry for each GIN_SEARCH_MODE_ALL scan key even if they are multiple for the same attribute. This is clearly not optimal. This commit improves the situation by introduction of "exclude only" scan keys. Such scan keys are not capable to return set of matching TIDs. Instead, they are capable only to filter TIDs produced by normal scan keys. Therefore, each attribute should contain at least one normal scan key, while rest of them may be "exclude only" if search mode is GIN_SEARCH_MODE_ALL. The same optimization might be applied to the whole scan, not per-attribute. But that leads to NULL values elimination problem. There is trade-off between multiple possible ways to do this. We probably want to do this later using some cost-based decision algorithm. Discussion: https://postgr.es/m/CAOBaU_YGP5-BEt5Cc0%3DzMve92vocPzD%2BXiZgiZs1kjY0cj%3DXBg%40mail.gmail.com Author: Nikita Glukhov, Alexander Korotkov, Tom Lane, Julien Rouhaud Reviewed-by: Julien Rouhaud, Tomas Vondra, Tom Lane
2020-01-17 23:11:39 +01:00
reset enable_seqscan;
reset enable_bitmapscan;
drop table t_gin_test_tbl;
-- test an unlogged table, mostly to get coverage of ginbuildempty
create unlogged table t_gin_test_tbl(i int4[], j int4[]);
create index on t_gin_test_tbl using gin (i, j);
insert into t_gin_test_tbl
values
(null, null),
('{}', null),
('{1}', '{2,3}');
drop table t_gin_test_tbl;