如何解决结合关系查询提高 Postgres jsonb 查询的性能
我有一个 SELECT 查询常规 postgres 表和一个 jsonb 列。 当我选择整个 jsonb 列时,查询速度很快(574 毫秒)。 但是,当我改为选择同一 jsonb 列的顶级路径时,查询速度会减慢 6 倍(3241 毫秒)。我的最终查询需要访问来自其中 4 个顶级 jsonb 路径的字符串数组值,这将查询速度减慢到 5 秒。
我在 cfiles
表中有大约 50K 条记录,而 jsonb 列 cfiles.property_values
的结构如下:
{
"Sample Names":["up to 200 short strings..."],"Project IDs": ["up to 10 short strings..."],"Run IDs": ["up to 10 short strings..."],"Data Type": ["up to 10 short strings..."]
}
在 this 回答之后,我尝试在下面添加 GIN 索引,但效果很小(下面评论中的运行时间),我假设是因为我的查询不是使用 @>
运算符的纯 json并与关系查询相结合。
CREATE INDEX ON cfiles USING GIN (property_values jsonb_path_ops);
我对获取整列与查询甚至仅查询顶级 json 键之间的巨大差异感到惊讶。在这一点上,将整个 jsonb 列作为字符串获取并将其拆分为逗号并截断引号似乎更高效,这是我希望避免的一种黑客行为。
我的目标是 更新:使用 PostgreSQL 版本 12
SELECT
-- FAST OPTION: getting all of json: no GIN=579ms; with GIN=574ms
cfiles.property_values as "1907",-- == vs ==
-- SLOW OPTION: getting a json path: no GIN=3273ms; with GIN=3241ms
cfiles.property_values #>> '{"Sample Names"}' as "1907",-- adding another path: with GIN=4028ms
cfiles.property_values #>> '{"Project IDs"}' as "1908",-- adding yet another path: with GIN=4774ms
cfiles.property_values #>> '{"Run IDs"}' as "1909",-- adding yet another path: with GIN=5558ms
cfiles.property_values #>> '{"Data Type"}' as "1910",-- ==== rest of query below I can't change ====
user_permissions.notified_at::text as "111",group_permissions.notified_at::text as "112",user_permissions.task_id::text as "113",group_permissions.task_id::text as "114",datasets.id as "151",datasets.name as "154",datasets.path as "155",datasets.last_modified as "156",datasets.file_count as "157",datasets.locked as "158",datasets.content_types as "159",cfiles.name as "105",cfiles.last_modified as "107",pg_size_pretty(cfiles.size::bigint) as "106",cfiles.id as "101",cfiles.tid as "102",cfiles.uuid as "103",cfiles.path as "104",cfiles.content_type as "108",cfiles.locked as "109",cfiles.checksum as "110"
FROM cfiles
JOIN datasets ON datasets.id=cfiles.dataset_id
LEFT JOIN user_permissions ON (user_permissions.cfile_id=cfiles.id OR user_permissions.dataset_id=datasets.id)
LEFT JOIN users on users.id=user_permissions.user_id
LEFT JOIN group_permissions ON (group_permissions.cfile_id=cfiles.id OR group_permissions.dataset_id=datasets.id)
LEFT JOIN groups ON groups.id=group_permissions.group_id
LEFT JOIN user_groups ON groups.id=user_groups.group_id
LEFT JOIN picklist_cfiles ON picklist_cfiles.cfile_id=cfiles.id
WHERE
cfiles.tid=5
ORDER BY "107" desc
LIMIT 20
OFFSET 0
Table "public.cfiles"
Column | Type | Collation | Nullable | Default
-----------------+-----------------------------+-----------+----------+------------------------------------
id | bigint | | not null | nextval('cfiles_id_seq'::regclass)
tid | bigint | | not null |
uuid | uuid | | not null | gen_random_uuid()
dataset_id | bigint | | not null |
path | character varying | | not null |
name | character varying | | |
checksum | character varying | | |
size | bigint | | |
last_modified | timestamp without time zone | | |
content_type | character varying | | |
locked | boolean | | not null | false
property_values | jsonb | | |
created_at | timestamp without time zone | | not null |
updated_at | timestamp without time zone | | not null |
Indexes:
"cfiles_pkey" PRIMARY KEY,btree (id)
"cfiles_property_values_idx" gin (property_values jsonb_path_ops)
"index_cfiles_dataset_id_path" UNIQUE,btree (dataset_id,path)
"index_cfiles_name" btree (name)
"index_cfiles_tid" btree (tid)
"index_cfiles_uuid_id_path" UNIQUE,btree (uuid)
Foreign-key constraints:
"cfiles_datasets_fk" FOREIGN KEY (dataset_id) REFERENCES datasets(id)
"cfiles_tenants_fk" FOREIGN KEY (tid) REFERENCES tenants(id)
Referenced by:
TABLE "group_permissions" CONSTRAINT "group_permissions_cfiles_fk" FOREIGN KEY (cfile_id) REFERENCES cfiles(id)
TABLE "picklist_cfiles" CONSTRAINT "picklist_cfiles_cfiles_fk" FOREIGN KEY (cfile_id) REFERENCES cfiles(id)
TABLE "user_permissions" CONSTRAINT "user_permissions_cfiles_fk" FOREIGN KEY (cfile_id) REFERENCES cfiles(id)
慢查询计划:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=13700.06..13700.11 rows=20 width=662) (actual time=5702.511..5702.521 rows=20 loops=1)
Output: ((cfiles.property_values #>> '{"Sample Names"}'::text[])),((cfiles.property_values #>> '{"Project IDs"}'::text[])),((cfiles.property_values #>> '{"Run IDs"}'::text[])),((cfiles.property_values #>> '{"Data Type"}'::text[])),((user_permissions.notified_at)::text),((group_permissions.notified_at)::text),((user_permissions.task_id)::text),((group_permissions.task_id)::text),datasets.id,datasets.name,datasets.path,datasets.last_modified,datasets.file_count,datasets.locked,datasets.content_types,cfiles.name,cfiles.last_modified,(pg_size_pretty(cfiles.size)),cfiles.id,cfiles.tid,cfiles.uuid,cfiles.path,cfiles.content_type,cfiles.locked,cfiles.checksum
-> Sort (cost=13700.06..13810.61 rows=44219 width=662) (actual time=5702.508..5702.512 rows=20 loops=1)
Output: ((cfiles.property_values #>> '{"Sample Names"}'::text[])),cfiles.checksum
Sort Key: cfiles.last_modified DESC
Sort Method: top-N heapsort Memory: 344kB
-> Hash Left Join (cost=39.53..12523.41 rows=44219 width=662) (actual time=2.535..5526.409 rows=44255 loops=1)
Output: (cfiles.property_values #>> '{"Sample Names"}'::text[]),(cfiles.property_values #>> '{"Project IDs"}'::text[]),(cfiles.property_values #>> '{"Run IDs"}'::text[]),(cfiles.property_values #>> '{"Data Type"}'::text[]),(user_permissions.notified_at)::text,(group_permissions.notified_at)::text,(user_permissions.task_id)::text,(group_permissions.task_id)::text,pg_size_pretty(cfiles.size),cfiles.checksum
Hash Cond: (cfiles.id = picklist_cfiles.cfile_id)
-> Nested Loop Left Join (cost=38.19..10918.99 rows=44219 width=867) (actual time=1.639..632.739 rows=44255 loops=1)
Output: cfiles.property_values,cfiles.size,cfiles.checksum,user_permissions.notified_at,user_permissions.task_id,group_permissions.notified_at,group_permissions.task_id
Join Filter: ((user_permissions.cfile_id = cfiles.id) OR (user_permissions.dataset_id = datasets.id))
Rows Removed by Join Filter: 177020
-> Nested Loop Left Join (cost=38.19..7822.61 rows=44219 width=851) (actual time=1.591..464.449 rows=44255 loops=1)
Output: cfiles.property_values,group_permissions.task_id
Join Filter: ((group_permissions.cfile_id = cfiles.id) OR (group_permissions.dataset_id = datasets.id))
Rows Removed by Join Filter: 354040
-> Hash Join (cost=35.75..4723.32 rows=44219 width=835) (actual time=1.301..163.411 rows=44255 loops=1)
Output: cfiles.property_values,datasets.content_types
Inner Unique: true
Hash Cond: (cfiles.dataset_id = datasets.id)
-> Seq Scan on public.cfiles (cost=0.00..4570.70 rows=44219 width=644) (actual time=0.044..49.425 rows=44255 loops=1)
Output: cfiles.id,cfiles.dataset_id,cfiles.property_values,cfiles.created_at,cfiles.updated_at
Filter: (cfiles.tid = 5)
Rows Removed by Filter: 1561
-> Hash (cost=28.11..28.11 rows=611 width=199) (actual time=1.234..1.235 rows=611 loops=1)
Output: datasets.id,datasets.content_types
Buckets: 1024 Batches: 1 Memory Usage: 149kB
-> Seq Scan on public.datasets (cost=0.00..28.11 rows=611 width=199) (actual time=0.012..0.571 rows=611 loops=1)
Output: datasets.id,datasets.content_types
-> Materialize (cost=2.44..3.97 rows=4 width=32) (actual time=0.000..0.002 rows=8 loops=44255)
Output: group_permissions.notified_at,group_permissions.task_id,group_permissions.cfile_id,group_permissions.dataset_id
-> Hash Right Join (cost=2.44..3.95 rows=4 width=32) (actual time=0.170..0.248 rows=8 loops=1)
Output: group_permissions.notified_at,group_permissions.dataset_id
Hash Cond: (user_groups.group_id = groups.id)
-> Seq Scan on public.user_groups (cost=0.00..1.34 rows=34 width=8) (actual time=0.022..0.056 rows=34 loops=1)
Output: user_groups.id,user_groups.tid,user_groups.user_id,user_groups.group_id,user_groups.created_at,user_groups.updated_at
-> Hash (cost=2.39..2.39 rows=4 width=40) (actual time=0.121..0.121 rows=4 loops=1)
Output: group_permissions.notified_at,group_permissions.dataset_id,groups.id
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Hash Right Join (cost=1.09..2.39 rows=4 width=40) (actual time=0.063..0.092 rows=4 loops=1)
Output: group_permissions.notified_at,groups.id
Hash Cond: (groups.id = group_permissions.group_id)
-> Seq Scan on public.groups (cost=0.00..1.19 rows=19 width=8) (actual time=0.010..0.017 rows=19 loops=1)
Output: groups.id,groups.tid,groups.name,groups.description,groups.default_uview,groups.created_at,groups.updated_at
-> Hash (cost=1.04..1.04 rows=4 width=40) (actual time=0.032..0.033 rows=4 loops=1)
Output: group_permissions.notified_at,group_permissions.group_id
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on public.group_permissions (cost=0.00..1.04 rows=4 width=40) (actual time=0.017..0.022 rows=4 loops=1)
Output: group_permissions.notified_at,group_permissions.group_id
-> Materialize (cost=0.00..1.06 rows=4 width=40) (actual time=0.000..0.001 rows=4 loops=44255)
Output: user_permissions.notified_at,user_permissions.cfile_id,user_permissions.dataset_id,user_permissions.user_id
-> Seq Scan on public.user_permissions (cost=0.00..1.04 rows=4 width=40) (actual time=0.021..0.025 rows=4 loops=1)
Output: user_permissions.notified_at,user_permissions.user_id
-> Hash (cost=1.15..1.15 rows=15 width=8) (actual time=0.040..0.040 rows=15 loops=1)
Output: picklist_cfiles.cfile_id
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on public.picklist_cfiles (cost=0.00..1.15 rows=15 width=8) (actual time=0.010..0.017 rows=15 loops=1)
Output: picklist_cfiles.cfile_id
Planning Time: 3.141 ms
Execution Time: 5702.799 ms
(61 rows)
更新:重构为 CTE 模式让我缩短到 20 毫秒
WITH T as (
select cfiles.property_values as prop_vals,user_permissions.notified_at::text as "111",cfiles.checksum as "110"
FROM cfiles
JOIN datasets ON datasets.id=cfiles.dataset_id
LEFT JOIN user_permissions ON (user_permissions.cfile_id=cfiles.id OR user_permissions.dataset_id=datasets.id)
LEFT JOIN users on users.id=user_permissions.user_id
LEFT JOIN group_permissions ON (group_permissions.cfile_id=cfiles.id OR group_permissions.dataset_id=datasets.id)
LEFT JOIN groups ON groups.id=group_permissions.group_id
LEFT JOIN user_groups ON groups.id=user_groups.group_id
LEFT JOIN picklist_cfiles ON picklist_cfiles.cfile_id=cfiles.id
WHERE
cfiles.tid=5
LIMIT 20
)
SELECT
prop_vals ->> 'Sample Names' as "1907",prop_vals ->> 'Project IDs' as "1908",prop_vals ->> 'Run IDs' as "1909",prop_vals ->> 'Data Type' as "1910","111","112","113","114","151","154","155","156","157","158","159","105","107","106","101","102","103","104","108","109","110"
FROM T
ORDER BY "107" desc;
CTE 查询计划:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=16.18..16.23 rows=20 width=662) (actual time=18.771..18.779 rows=20 loops=1)
Output: ((t.prop_vals ->> 'Sample Names'::text)),((t.prop_vals ->> 'Project IDs'::text)),((t.prop_vals ->> 'Run IDs'::text)),((t.prop_vals ->> 'Data Type'::text)),t."111",t."112",t."113",t."114",t."151",t."154",t."155",t."156",t."157",t."158",t."159",t."105",t."107",t."106",t."101",t."102",t."103",t."104",t."108",t."109",t."110"
Sort Key: t."107" DESC
Sort Method: quicksort Memory: 368kB
-> Subquery Scan on t (cost=4.05..15.74 rows=20 width=662) (actual time=1.091..18.412 rows=20 loops=1)
Output: (t.prop_vals ->> 'Sample Names'::text),(t.prop_vals ->> 'Project IDs'::text),(t.prop_vals ->> 'Run IDs'::text),(t.prop_vals ->> 'Data Type'::text),t."110"
-> Limit (cost=4.05..15.34 rows=20 width=987) (actual time=0.320..1.241 rows=20 loops=1)
Output: cfiles.property_values,cfiles.checksum
-> Nested Loop Left Join (cost=4.05..24965.23 rows=44219 width=987) (actual time=0.318..1.224 rows=20 loops=1)
Output: cfiles.property_values,cfiles.checksum
Join Filter: ((user_permissions.cfile_id = cfiles.id) OR (user_permissions.dataset_id = datasets.id))
Rows Removed by Join Filter: 80
-> Nested Loop Left Join (cost=4.05..20873.92 rows=44219 width=851) (actual time=0.273..1.056 rows=20 loops=1)
Output: cfiles.property_values,group_permissions.task_id
Join Filter: ((group_permissions.cfile_id = cfiles.id) OR (group_permissions.dataset_id = datasets.id))
Rows Removed by Join Filter: 160
-> Nested Loop (cost=1.61..17774.63 rows=44219 width=835) (actual time=0.125..0.745 rows=20 loops=1)
Output: cfiles.property_values,datasets.content_types
Inner Unique: true
-> Hash Left Join (cost=1.34..4738.00 rows=44219 width=644) (actual time=0.094..0.475 rows=20 loops=1)
Output: cfiles.property_values,cfiles.dataset_id
Hash Cond: (cfiles.id = picklist_cfiles.cfile_id)
-> Seq Scan on public.cfiles (cost=0.00..4570.70 rows=44219 width=644) (actual time=0.046..0.360 rows=20 loops=1)
Output: cfiles.id,cfiles.updated_at
Filter: (cfiles.tid = 5)
Rows Removed by Filter: 629
-> Hash (cost=1.15..1.15 rows=15 width=8) (actual time=0.034..0.035 rows=15 loops=1)
Output: picklist_cfiles.cfile_id
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on public.picklist_cfiles (cost=0.00..1.15 rows=15 width=8) (actual time=0.010..0.018 rows=15 loops=1)
Output: picklist_cfiles.cfile_id
-> Index Scan using datasets_pkey on public.datasets (cost=0.28..0.29 rows=1 width=199) (actual time=0.008..0.008 rows=1 loops=20)
Output: datasets.id,datasets.tid,datasets.bucket_path_id,datasets.size,datasets.created_at,datasets.updated_at
Index Cond: (datasets.id = cfiles.dataset_id)
-> Materialize (cost=2.44..3.97 rows=4 width=32) (actual time=0.005..0.009 rows=8 loops=20)
Output: group_permissions.notified_at,group_permissions.dataset_id
-> Hash Right Join (cost=2.44..3.95 rows=4 width=32) (actual time=0.088..0.122 rows=8 loops=1)
Output: group_permissions.notified_at,group_permissions.dataset_id
Hash Cond: (user_groups.group_id = groups.id)
-> Seq Scan on public.user_groups (cost=0.00..1.34 rows=34 width=8) (actual time=0.007..0.016 rows=34 loops=1)
Output: user_groups.id,user_groups.updated_at
-> Hash (cost=2.39..2.39 rows=4 width=40) (actual time=0.069..0.069 rows=4 loops=1)
Output: group_permissions.notified_at,groups.id
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Hash Right Join (cost=1.09..2.39 rows=4 width=40) (actual time=0.043..0.064 rows=4 loops=1)
Output: group_permissions.notified_at,groups.id
Hash Cond: (groups.id = group_permissions.group_id)
-> Seq Scan on public.groups (cost=0.00..1.19 rows=19 width=8) (actual time=0.006..0.011 rows=19 loops=1)
Output: groups.id,groups.updated_at
-> Hash (cost=1.04..1.04 rows=4 width=40) (actual time=0.022..0.022 rows=4 loops=1)
Output: group_permissions.notified_at,group_permissions.group_id
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on public.group_permissions (cost=0.00..1.04 rows=4 width=40) (actual time=0.009..0.014 rows=4 loops=1)
Output: group_permissions.notified_at,group_permissions.group_id
-> Materialize (cost=0.00..1.06 rows=4 width=40) (actual time=0.001..0.003 rows=4 loops=20)
Output: user_permissions.notified_at,user_permissions.user_id
-> Seq Scan on public.user_permissions (cost=0.00..1.04 rows=4 width=40) (actual time=0.018..0.022 rows=4 loops=1)
Output: user_permissions.notified_at,user_permissions.user_id
Planning Time: 4.049 ms
Execution Time: 19.128 ms
(60 rows)
解决方法
您的慢查询正在对所有 44255 行的大型 jsonb 数据进行 deTOAST,然后通过排序携带解析出的值以挑选出前 20 行。 (我不知道为什么它会那样急切地进行 deTOASTing)。所以 44235 个 JSONB 被 deTOAST 扔掉了。
您的快速查询(大概)是从散列连接返回 TOAST 指针,使用这些小指针对行进行排序,然后仅对 20 个幸存者进行 deTOAST。在 EXPLAIN ANALYZE 的情况下,它甚至没有 deTOAST 幸存者,它只是将指针扔掉。
这就是“为什么”,至于该怎么做,如果您确实无法更改最顶部下方的任何查询,我怀疑您可以在服务器端对此做些什么。
如果您可以更大幅度地修改查询,那么您可以使用 CTE 改进运行时间。让 CTE 选择整个 jsonb,然后 CTE 上的选择从中取出值。
WITH T as (select cfiles.property_values as "1907",<rest of query>)
SELECT "1907"->>'name1',"1907"->>'name2',<rest of select list> from T;
,
除了@jjanes 已经说过的之外,您可以先将记录数量限制为 20 条记录,然后再做剩下的工作。像这样:
WITH i(id) AS (
-- core piece of SQL to select the records you're looking for
SELECT
cfiles.ID
FROM
cfiles
JOIN datasets ON datasets.ID = cfiles.dataset_id
WHERE
cfiles.tid = 5
ORDER BY
cfiles.last_modified DESC
LIMIT 20 OFFSET 0
)
SELECT-- FAST OPTION: getting all of json: no GIN=579ms; with GIN=574ms
cfiles.property_values AS "1907",-- == vs ==
-- SLOW OPTION: getting a json path: no GIN=3273ms; with GIN=3241ms
cfiles.property_values #>> '{"Sample Names"}' AS "1907",-- adding another path: with GIN=4028ms
cfiles.property_values #>> '{"Project IDs"}' AS "1908",-- adding yet another path: with GIN=4774ms
cfiles.property_values #>> '{"Run IDs"}' AS "1909",-- adding yet another path: with GIN=5558ms
cfiles.property_values #>> '{"Data Type"}' AS "1910",-- ==== rest of query below I can't change ====
user_permissions.notified_at :: TEXT AS "101",group_permissions.notified_at :: TEXT AS "102",user_permissions.task_id :: TEXT AS "103",group_permissions.task_id :: TEXT AS "104",datasets.ID AS "151",datasets.NAME AS "154",datasets.PATH AS "155",datasets.last_modified AS "156",datasets.file_count AS "157",datasets.locked AS "158",datasets.content_types AS "159",cfiles.NAME AS "105",cfiles.last_modified AS "107",pg_size_pretty ( cfiles.SIZE :: BIGINT ) AS "106",cfiles.ID AS "101",cfiles.tid AS "102",cfiles.uuid AS "103",cfiles.PATH AS "104",cfiles.content_type AS "108",cfiles.locked AS "109",cfiles.checksum AS "110"
FROM
cfiles
JOIN i USING(id) -- should match just 20 records
JOIN datasets ON datasets.ID = cfiles.dataset_id
LEFT JOIN user_permissions ON ( user_permissions.cfile_id = cfiles.ID OR user_permissions.dataset_id = datasets.ID )
LEFT JOIN users ON users.ID = user_permissions.user_id
LEFT JOIN group_permissions ON ( group_permissions.cfile_id = cfiles.ID OR group_permissions.dataset_id = datasets.ID )
LEFT JOIN groups ON groups.ID = group_permissions.group_id
LEFT JOIN user_groups ON groups.ID = user_groups.group_id
LEFT JOIN picklist_cfiles ON picklist_cfiles.cfile_id = cfiles.ID
ORDER BY
"107" DESC;
您可能想要重写具有 OR 条件的两个 LEFT JOIN,您可以使用使用 UNION ALL 的子查询。这可能会加快速度
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。