为什么 PostgreSQL 中多维数据集列上的 GIST 索引实际上会使 K-最近邻 (KNN) ORDER BY 查询变得更糟？

如何解决为什么 PostgreSQL 中多维数据集列上的 GIST 索引实际上会使 K-最近邻 (KNN) ORDER BY 查询变得更糟？

添加 GIST 索引实际上似乎使 Postgresql 中 ORDER BY 列的 K-最近邻 (KNN) cube 查询更糟。为什么会这样，我们可以做些什么？

这就是我的意思。在 Postgresql 数据库中，我有一个表，其 DDL 为 create sample (id serial primary key,title text,embedding cube)，其中 embedding 列是使用 Google 语言模型获得的 title 的嵌入向量。 cube 数据类型由我安装的多维数据集扩展提供。顺便说一下，这些是维基百科文章的标题。无论如何，有100万条记录。然后我使用以下查询执行 KNN 查询。此查询使用欧几里得距离运算符 distance 定义 <->，但其他两个指标的结果相似。它执行 ORDER BY 并应用 LIMIT 以查找具有“相似”标题（最相似的是目标标题本身）的 10 篇维基百科文章。一切正常。

select sample.title,sample.embedding <-> cube('(0.18936706,-0.12455666,-0.31581765,0.0192692,-0.07364611,0.07851536,0.0290586,-0.02582532,-0.03378124,-0.10564457,-0.03903799,0.08668878,-0.15357816,-0.17793414,-0.01826405,0.01969068,0.11386908,0.1555583,0.09368557,0.13697313,-0.05610929,-0.06536788,-0.12212707,0.26356605,-0.06004387,-0.01966437,-0.1250324,-0.16645767,-0.13525756,0.22482251,-0.1709727,0.28966117,-0.07927769,-0.02498624,-0.10018375,-0.10923951,0.04770213,0.11573371,0.04619929,0.05216618,0.19176421,0.12948817,0.08719034,-0.16109011,-0.02411379,-0.05638905,-0.37334979,0.31225419,0.0744801,0.27044332)') distance from sample order by distance limit 10;

然而，令我困惑的是，如果我在 embedding 列上放置 GIST 索引，查询性能实际上更糟。添加索引后，查询计划会按预期以预期的方式更改，就它使用索引而言。但是……它变慢了！

这似乎与 cube 的 documentation 相悖，它指出：

此外，多维数据集 GiST 索引可用于在 ORDER BY 子句中使用度量运算符、和来查找最近邻

他们甚至提供了一个示例查询，这与我的非常相似。

SELECT c FROM test ORDER BY c <-> cube(array[0.5,0.5,0.5]) LIMIT 1

这是删除索引之前的查询计划和时间信息。

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.41..6.30 rows=10 width=29)
   ->  Index Scan using sample_embedding_idx on sample  (cost=0.41..589360.33 rows=999996 width=29)
         Order By: (embedding <-> '(0.18936706,0.27044332)'::cube)
(3 rows)

        title         |      distance      
----------------------+--------------------
 david petrarca       | 0.5866321762629475
 david adamski        | 0.5866321762629475
 richard ansdell      | 0.6239883862603475
 linda darke          | 0.6392124797481789
 ilias tsiliggiris    | 0.6996660649119893
 watson,jim          | 0.7059481479504834
 sk radni%c4%8dki     |   0.71718948226995
 burnham,pa          | 0.7384858030758069
 arthur (europa-park) | 0.7468462897336924
 ivan kecojevic       | 0.7488206082281348
(10 rows)

Time: 1226.457 ms (00:01.226)

而且，这是删除索引后的查询计划和时间信息。

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=74036.32..74037.48 rows=10 width=29)
   ->  Gather Merge  (cost=74036.32..171264.94 rows=833330 width=29)
         Workers Planned: 2
         ->  Sort  (cost=73036.29..74077.96 rows=416665 width=29)
               Sort Key: ((embedding <-> '(0.18936706,0.27044332)'::cube))
               ->  Parallel Seq Scan on sample  (cost=0.00..64032.31 rows=416665 width=29)
(6 rows)

        title         |      distance      
----------------------+--------------------
 david petrarca       | 0.5866321762629475
 david adamski        | 0.5866321762629475
 richard ansdell      | 0.6239883862603475
 linda darke          | 0.6392124797481789
 ilias tsiliggiris    | 0.6996660649119893
 watson,pa          | 0.7384858030758069
 arthur (europa-park) | 0.7468462897336924
 ivan kecojevic       | 0.7488206082281348
(10 rows)

Time: 381.419 ms

注意：

带索引：1226.457 毫秒
无索引：381.419 毫秒

这种非常令人费解的行为！所有这些都记录在 GitHub repo 中，以便其他人可以尝试。我将添加有关如何生成嵌入向量的文档，但应该不需要，就像在快速入门中一样，我展示了可以从我的 Google Drive 文件夹下载预先计算的嵌入向量.

附录

在下面的评论中要求提供 explain (analyze,buffers) 的输出。就是这里

我重新创建了（覆盖）索引
我使用 explain (analyze,buffers)

查询

我删除了索引
我再次使用 explain (analyze,buffers) 运行查询

pgbench=# create index on sample using gist (embedding) include (title);
CREATE INDEX
Time: 51966.315 ms (00:51.966)
pgbench=# 
                                                                                                                                                                                                                                                                                                                                       QUERY PLAN                                                                                                                                                                                                                                                                                                                                        
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.41..4.15 rows=10 width=29) (actual time=3215.956..3216.667 rows=10 loops=1)
   Buffers: shared hit=1439 read=87004 written=7789
   ->  Index Only Scan using sample_embedding_title_idx on sample  (cost=0.41..373768.39 rows=999999 width=29) (actual time=3215.932..3216.441 rows=10 loops=1)
         Order By: (embedding <-> '(0.18936706,0.27044332)'::cube)
         Heap Fetches: 0
         Buffers: shared hit=1439 read=87004 written=7789
 Planning:
   Buffers: shared hit=14 read=6 dirtied=2
 Planning Time: 0.432 ms
 Execution Time: 3316.266 ms
(10 rows)

Time: 3318.333 ms (00:03.318)
pgbench=# drop index sample_embedding_title_idx;
DROP INDEX
Time: 182.324 ms
pgbench=# 
                                                                                                                                                                                                                                                                                                                                           QUERY PLAN                                                                                                                                                                                                                                                                                                                                            
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=74036.35..74037.52 rows=10 width=29) (actual time=6052.845..6057.210 rows=10 loops=1)
   Buffers: shared hit=70 read=58830
   ->  Gather Merge  (cost=74036.35..171265.21 rows=833332 width=29) (actual time=6052.825..6057.021 rows=10 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=70 read=58830
         ->  Sort  (cost=73036.33..74077.99 rows=416666 width=29) (actual time=6002.928..6003.019 rows=8 loops=3)
               Sort Key: ((embedding <-> '(0.18936706,0.27044332)'::cube))
               Sort Method: top-N heapsort  Memory: 26kB
               Buffers: shared hit=70 read=58830
               Worker 0:  Sort Method: top-N heapsort  Memory: 26kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 26kB
               ->  Parallel Seq Scan on sample  (cost=0.00..64032.33 rows=416666 width=29) (actual time=0.024..3090.103 rows=333333 loops=3)
                     Buffers: shared read=58824
 Planning:
   Buffers: shared hit=3 read=3 dirtied=1
 Planning Time: 0.129 ms
 Execution Time: 6057.388 ms
(18 rows)

Time: 6053.284 ms (00:06.053)