优化“重新发布”MySQL 查询

如何解决优化“重新发布”MySQL 查询

我有一个帖子表、一个转帖表和一个表示“用户关注”状态的表。

我想做一些类似 Twitter 的事情，在那里我会显示所有关注用户的帖子或转发。

我希望帖子在第一次出现时出现，这样如果多个用户重新发布帖子，它只会出现在第一次。

为了加快查询速度，我在每次创建帖子时插入到 repost 表中，这样也会创建相应的 repost（来自作者）。

我的架构如下所示：

Table Post
id: INT
userId: INT
time: INT

Table Repost
id: INT
postId: INT
userId: INT
time: INT

Table users_following
userId: INT
followerId: INT

我的查询看起来像这样。

SELECT sr.* FROM Repost sr
INNER JOIN (
    SELECT MIN(ir.time) min_time,ir.postId FROM Repost ir
    WHERE ir.userId IN (
        SELECT uf.userId FROM users_following uf WHERE
        ir.userId = uf.userId AND uf.followerId = 1
    )
    OR ir.userId = 1
    GROUP BY ir.postId
) rr ON rr.postId = sr.postId AND sr.time = rr.min_time

想法是这样的：

SELECT FROM users_following uf。选择查看者后跟的所有用户 ID。
SELECT FROM Repost ir. 选择给定帖子的最短转发时间，其中转发者 ID 是关注的用户或查看者。
SELECT FROM Repost sr. 使用内连接选择给定帖子的最短时间的转发。

这有效，但第 3 阶段很慢。我相信这是因为一旦我们有一个很大的 min_times 列表，我们就不能使用任何索引从该子查询中进行选择，这意味着我们需要扫描所有内容。有没有办法构造此查询以使其性能更高？

这是针对铁杆读者的完整EXPLAIN和SHOW CREATE TABLE。

解释

+----+--------------------+------------+------------+--------+-------------------------------------------------------------+----------------------+---------+---------------------------------+--------+----------+--------------------------+
| id | select_type        | table      | partitions | type   | possible_keys                                               | key                  | key_len | ref                             | rows   | filtered | Extra                    |
+----+--------------------+------------+------------+--------+-------------------------------------------------------------+----------------------+---------+---------------------------------+--------+----------+--------------------------+
|  1 | PRIMARY            | <derived2> | NULL       | ALL    | NULL                                                        | NULL                 | NULL    | NULL                            | 797455 |   100.00 | Using where              |
|  1 | PRIMARY            | sr         | NULL       | ref    | IDX_DA9843F3E094D20D,repost_time_idx,repost_stream_idx      | repost_time_idx      | 4       | rr.min_time                     |      1 |     4.92 | Using where              |
|  2 | DERIVED            | ir         | NULL       | index  | IDX_DA9843F364B64DCC,IDX_DA9843F3E094D20D,repost_stream_idx | IDX_DA9843F3E094D20D | 4       | NULL                            | 797456 |   100.00 | Using where              |
|  3 | DEPENDENT SUBQUERY | uf         | NULL       | eq_ref | PRIMARY,IDX_17C2F70264B64DCC,IDX_17C2F702F542AA03           | PRIMARY              | 8       | prose_2_24_2021.ir.userId,const |      1 |   100.00 | Using where; Using index |
+----+--------------------+------------+------------+--------+-------------------------------------------------------------+----------------------+---------+---------------------------------+--------+----------+--------------------------+

SHOW CREATE TABLE Repost

CREATE TABLE `Repost` (
  `id` int(11) NOT NULL AUTO_INCREMENT,`postId` int(11) NOT NULL,`userId` int(11) NOT NULL,`time` int(11) NOT NULL,`isRepost` int(11) NOT NULL,PRIMARY KEY (`id`),KEY `IDX_DA9843F364B64DCC` (`userId`),KEY `IDX_DA9843F3E094D20D` (`postId`),KEY `repost_time_idx` (`time`),KEY `repost_stream_idx` (`time`,`userId`,`postId`),CONSTRAINT `FK_DA9843F364B64DCC` FOREIGN KEY (`userId`) REFERENCES `ProseUser` (`id`),CONSTRAINT `FK_DA9843F3E094D20D` FOREIGN KEY (`postId`) REFERENCES `Post` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=809018 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

SHOW CREATE TABLE users_following

CREATE TABLE `users_following` (
  `userId` int(11) NOT NULL,`followerId` int(11) NOT NULL,PRIMARY KEY (`userId`,`followerId`),KEY `IDX_17C2F70264B64DCC` (`userId`),KEY `IDX_17C2F702F542AA03` (`followerId`),CONSTRAINT `FK_17C2F70264B64DCC` FOREIGN KEY (`userId`) REFERENCES `ProseUser` (`id`),CONSTRAINT `FK_17C2F702F542AA03` FOREIGN KEY (`followerId`) REFERENCES `ProseUser` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

编辑

像这样调整查询会产生更快的结果，尽管添加 ORDER BY 会使其变慢。没有 ORDER BY，这个查询很好。

SELECT sr.* FROM Repost sr
INNER JOIN (
    SELECT MIN(ir.time) min_time,ir.postId FROM Repost ir
    INNER JOIN users_following uf ON ir.userId = uf.userId AND uf.followerId = 1
    GROUP BY ir.postId
) rr ON rr.postId = sr.postId AND sr.time = rr.min_time
ORDER BY sr.time desc
LIMIT 10

以下是此查询的说明：

+----+-------------+------------+------------+--------+--------------------------------------------------------------------------------+----------------------+---------+---------------------------+------+----------+----------------------------------------------+
| id | select_type | table      | partitions | type   | possible_keys                                                                  | key                  | key_len | ref                       | rows | filtered | Extra                                        |
+----+-------------+------------+------------+--------+--------------------------------------------------------------------------------+----------------------+---------+---------------------------+------+----------+----------------------------------------------+
|  1 | PRIMARY     | <derived2> | NULL       | ALL    | NULL                                                                           | NULL                 | NULL    | NULL                      |  691 |   100.00 | Using where; Using temporary; Using filesort |
|  1 | PRIMARY     | sr         | NULL       | ref    | IDX_DA9843F3E094D20D,repost_stream_idx,repost_stream2_idx      | repost_stream2_idx   | 8       | rr.min_time,rr.postId     |    1 |   100.00 | NULL                                         |
|  2 | DERIVED     | uf         | NULL       | ref    | PRIMARY,IDX_17C2F702F542AA03                              | IDX_17C2F702F542AA03 | 4       | const                     |  145 |   100.00 | Using index; Using temporary; Using filesort |
|  2 | DERIVED     | ir         | NULL       | ref    | IDX_DA9843F364B64DCC,repost_stream2_idx | IDX_DA9843F364B64DCC | 4       | prose_2_24_2021.uf.userId |    9 |   100.00 | NULL                                         |
|  2 | DERIVED     | rp         | NULL       | eq_ref | PRIMARY,post_spotlight_idx,post_time_idx,post_trending_idx                     | PRIMARY              | 4       | prose_2_24_2021.ir.postId |    1 |    50.00 | Using where                                  |
+----+-------------+------------+------------+--------+--------------------------------------------------------------------------------+----------------------+---------+---------------------------+------+----------+----------------------------------------------+

解决方法

我编写此类排名查询的典型方式是：

select id,postid,userid,time
from
(
  select rp.*,min(time) over (partition by postid) as first_time
  from repost rp
  where userid = 1 
  or userid in (select userid from users_following where followerid = 1)
) numbered
where time = first_time;

有时优化器在使用 OR 时会遇到问题，如果他们认为这样做更快的话，他们看不到他们可以运行两次表。在这种情况下，我们可以用 UNION 提示：

select id,min(time) over (partition by postid) as first_time
  from
  (
    select *
    from repost
    where userid = 1 
    union all
    select *
    from repost
    where userid in (select userid from users_following where followerid = 1)
  ) rp
) numbered
where time = first_time;

曾经 MySQL 因 IN 子句有问题而闻名。我不认为现在是这种情况了。如果 DBMS 确实有问题，您可以使用 EXISTS 代替：

from repost rp
where exists 
(
  select null
  from users_following uf
  where uf.userid = rp.userid 
  and uf.followerid = 1
)

在版本 8 之前的 MySQL 版本中，诸如 MIN OVER 之类的分析函数不可用。在这些版本中，您必须找到每篇文章的最短时间，然后再次阅读表格。一种直接的方式：

select *
from repost
where (postid,time) in
(
  select postid,min(time)
  from repost
  where userid = 1 
  or userid in (select userid from users_following where followerid = 1)
  group by postid
);

在任何情况下，您都希望通过索引快速查找关注的用户。 DBMS 可以免费提供转发用户并检查他们是否被用户 #1 关注，或者获取用户 #1 并找到所有关注的用户。所以我会提供两个索引：

create index idx1 on users_following (userid,followerid);
create index idx2 on users_following (followerid,userid);

然后你想快速找到他们的转发，然后按帖子 ID 分组并按时间排序。索引：

create index idx3 on repost (userid,time);

另一种看待这个问题的方法：如果我们通读整个表格并保留所需用户的行，如果行已经按 postid、time 排序，那就太好了。所以，以防万一：

create index idx3 on repost (postid,time);

用于完整索引扫描。

索引是提供给 DBMS 的。 DBMS 可以接受此提议并使用或不使用索引。我经常做的事情：

考虑 DBMS 访问表的顺序。
为这些路由提供索引。
使用 EXPLAIN 查看我的哪些索引被使用。
放下其他人。

转帖需要对索引进行大修

  PRIMARY KEY (`id`),KEY `IDX_DA9843F364B64DCC` (`userId`),KEY `IDX_DA9843F3E094D20D` (`postId`),KEY `repost_time_idx` (`time`),KEY `repost_stream_idx` (`time`,`userId`,`postId`),

到

  PRIMARY KEY(postId,userId,time,id),-- `id` is for uniqueness
  INDEX(id)  -- to keep AUTO_INCREMENT happy

（不知道别人有没有用。）

将 IN ( SELECT ... ) 改为 EXISTS ( SELECT 1 ... )

OR 是性能杀手。用 OR 的一侧对查询计时，然后用另一侧计时。假设这些时间的总和比您当前的时间快，UNION 那些加在一起。如果可行，请简化每个查询。给我看结果；我可能有更多索引建议。

优化“重新发布”MySQL 查询

如何解决优化“重新发布”MySQL 查询

解决方法

相关推荐