如何解决如何过滤具有多次通过关系的 SQL 结果
我很好奇。众所周知,好奇心以杀死猫而闻名。
那么,给cat-skinning最快的方法是什么?
本次测试的cat-skinning环境:
- Debian Squeeze 上的具有不错的 RAM 和设置。
- 6.000 名学生,24.000 名俱乐部会员(从具有现实生活数据的类似数据库中复制的数据。)
- 稍微偏离了问题中的命名模式:
student.id
isstudent.stud_id
和club.id
isclub.club_id
here。 - 我在这个线程中以其作者的名字命名了这些查询。
- 我运行了几次所有查询来填充缓存,然后我选择了 5 个中最好的
EXPLAIN ANALYZE
. - 相关指标(应该是最佳的——只要我们不知道哪些俱乐部会被查询):
ALTER TABLE student ADD CONSTRAINT student_pkey PRIMARY KEY(stud_id );
ALTER TABLE student_club ADD CONSTRAINT sc_pkey PRIMARY KEY(stud_id, club_id);
ALTER TABLE club ADD CONSTRAINT club_pkey PRIMARY KEY(club_id );
CREATE INDEX sc_club_id_idx ON student_club (club_id);
club_pkey
此处的大多数查询不需要。
主键在 Postgresql 中自动实现唯一索引。
最后一个索引是为了弥补Postgresql上多列索引的这个众所周知的缺点:
多列 B 树索引可与涉及索引列的任何子集的查询条件一起使用,但当对前导(最左侧)列有约束时,索引最有效。
结果
总运行时间来自EXPLAIN ANALYZE
.
1) Martin 2: 44.594 ms
SELECT s.stud_id, s.name
FROM student s
JOIN student_club sc USING (stud_id)
WHERE sc.club_id IN (30, 50)
GROUP BY 1,2
HAVING COUNT(*) > 1;
2) Erwin 1: 33.217 ms
SELECT s.stud_id, s.name
FROM student s
JOIN (
SELECT stud_id
FROM student_club
WHERE club_id IN (30, 50)
GROUP BY 1
HAVING COUNT(*) > 1
) sc USING (stud_id);
3) Martin 1: 31.735 ms
SELECT s.stud_id, s.name
FROM student s
WHERE student_id IN (
SELECT student_id
FROM student_club
WHERE club_id = 30
INTERSECT
SELECT stud_id
FROM student_club
WHERE club_id = 50
);
4) Derek: 2.287 ms
SELECT s.stud_id, s.name
FROM student s
WHERE s.stud_id IN (SELECT stud_id FROM student_club WHERE club_id = 30)
AND s.stud_id IN (SELECT stud_id FROM student_club WHERE club_id = 50);
5) Erwin 2: 2.181 ms
SELECT s.stud_id, s.name
FROM student s
WHERE EXISTS (SELECT * FROM student_club
WHERE stud_id = s.stud_id AND club_id = 30)
AND EXISTS (SELECT * FROM student_club
WHERE stud_id = s.stud_id AND club_id = 50);
6) Sean: 2.043 ms
SELECT s.stud_id, s.name
FROM student s
JOIN student_club x ON s.stud_id = x.stud_id
JOIN student_club y ON s.stud_id = y.stud_id
WHERE x.club_id = 30
AND y.club_id = 50;
最后三个表现几乎相同。4) 和 5) 导致相同的查询计划。
后期添加
7) ypercube 1: 148.649 ms
SELECT s.stud_id, s.name
FROM student AS s
WHERE NOT EXISTS (
SELECT *
FROM club AS c
WHERE c.club_id IN (30, 50)
AND NOT EXISTS (
SELECT *
FROM student_club AS sc
WHERE sc.stud_id = s.stud_id
AND sc.club_id = c.club_id
)
);
8) ypercube 2: 147.497 ms
SELECT s.stud_id, s.name
FROM student AS s
WHERE NOT EXISTS (
SELECT *
FROM (
SELECT 30 AS club_id
UNION ALL
SELECT 50
) AS c
WHERE NOT EXISTS (
SELECT *
FROM student_club AS sc
WHERE sc.stud_id = s.stud_id
AND sc.club_id = c.club_id
)
);
正如预期的那样,这两者的表现几乎相同。查询计划导致表扫描,计划器在这里找不到使用索引的方法。
9) Wildplasser 1: 49.849 ms
WITH RECURSIVE two AS (
SELECT 1::int AS level
, stud_id
FROM student_club sc1
WHERE sc1.club_id = 30
UNION
SELECT two.level + 1 AS level
, sc2.stud_id
FROM student_club sc2
JOIN two USING (stud_id)
WHERE sc2.club_id = 50
AND two.level = 1
)
SELECT s.stud_id, s.student
FROM student s
JOIN two USING (studid)
WHERE two.level > 1;
10) Wildplasser 2: 36.986 ms
WITH sc AS (
SELECT stud_id
FROM student_club
WHERE club_id IN (30,50)
GROUP BY stud_id
HAVING COUNT(*) > 1
)
SELECT s.*
FROM student s
JOIN sc USING (stud_id);
查询 2) 的 CTE 变体。令人惊讶的是,对于完全相同的数据,它可能会导致略有不同的查询计划。我在 上找到了一个顺序扫描student
,其中子查询变体使用了索引。
11) ypercube 3: 101.482 ms
另一个后期添加 ypercube。真是太神奇了,有多少种方法。
SELECT s.stud_id, s.student
FROM student s
JOIN student_club sc USING (stud_id)
WHERE sc.club_id = 10 -- member in 1st club ...
AND NOT EXISTS (
SELECT *
FROM (SELECT 14 AS club_id) AS c -- can't be excluded for missing the 2nd
WHERE NOT EXISTS (
SELECT *
FROM student_club AS d
WHERE d.stud_id = sc.stud_id
AND d.club_id = c.club_id
)
);
12) erwin 3: 2.377 ms
ypercube 的 11) 实际上只是这个更简单变体的令人费解的反向方法,它也仍然缺失。执行速度几乎与顶级猫科动物一样快。
SELECT s.*
FROM student s
JOIN student_club x USING (stud_id)
WHERE sc.club_id = 10 -- member in 1st club ...
AND EXISTS ( -- ... and membership in 2nd exists
SELECT *
FROM student_club AS y
WHERE y.stud_id = s.stud_id
AND y.club_id = 14
);
13) erwin 4: 2.375 ms
难以置信,但这是另一个真正的新变体。我看到了两个以上会员资格的潜力,但它也仅凭两个就跻身顶级among 之列。
SELECT s.*
FROM student AS s
WHERE EXISTS (
SELECT *
FROM student_club AS x
JOIN student_club AS y USING (stud_id)
WHERE x.stud_id = s.stud_id
AND x.club_id = 14
AND y.club_id = 10
);
俱乐部会员动态数量
换句话说:不同数量的过滤器。这个问题要求恰好有俱乐部会员资格。但是许多用例必须为不同的数量做准备。
解决方法
假设我有表student
,club
,和student_club
:
student {
id
name
}
club {
id
name
}
student_club {
student_id
club_id
}
我想知道如何找到足球 (30) 和棒球 (50) 俱乐部的所有学生。
虽然这个查询不起作用,但它是我迄今为止最接近的事情:
SELECT student.*
FROM student
INNER JOIN student_club sc ON student.id = sc.student_id
LEFT JOIN club c ON c.id = sc.club_id
WHERE c.id = 30 AND c.id = 50
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。