postgresql – 在Postgres上多个字段上的松散索引扫描？

我在Postgres 9.2(数百万行)中有几个大表,我需要根据两个字段’source'(varchar)和’id'(int)的组合生成一个唯一的代码.我可以通过生成row_numbers来执行此操作：

SELECT source,id FROM tablename GROUP BY source,id

但结果可能需要一段时间才能完成.有人建议,如果字段被索引,并且索引值的比例很小(这是我的情况),那么松散的索引扫描可能是更好的选择：http://wiki.postgresql.org/wiki/Loose_indexscan

WITH RECURSIVE
     t AS (SELECT min(col) AS col FROM tablename
           UNION ALL
           SELECT (SELECT min(col) FROM tablename WHERE col > t.col) FROM t WHERE t.col IS NOT NULL)
SELECT col FROM t WHERE col IS NOT NULL
UNION ALL
SELECT NULL WHERE EXISTS(SELECT * FROM tablename WHERE col IS NULL);

该示例在单个字段上运行.尝试返回多个字段会产生错误：子查询必须只返回一列.一种可能性是尝试检索整个行 – 例如SELECT ROW(min(source),min(id)…,但是我不确定WHERE语句的语法需要看起来像处理单个行元素.

问题是：可以修改基于递归的代码以使用多个列,如果是,如何？我致力于使用Postgres,但看起来MysqL已经为多个列实现了松散的索引扫描：http://dev.mysql.com/doc/refman/5.1/en/group-by-optimization.html

按照建议,我附上了EXPLAIN ANALYZE结果.

对于我的情况 – 我使用GROUP BY为2列选择不同的值,它是以下内容：

HashAggregate  (cost=1645408.44..1654099.65 rows=869121 width=34) (actual time=35411.889..36008.475 rows=1233080 loops=1)
   ->  Seq Scan on tablename  (cost=0.00..1535284.96 rows=22024696 width=34) (actual time=4413.311..25450.840 rows=22025768 loops=1)
 Total runtime: 36127.789 ms
(3 rows)

我不知道如何进行2列索引扫描(这是问题),但为了进行比较,在一列上使用GROUP BY,我得到：

HashAggregate  (cost=1590346.70..1590347.69 rows=99 width=8) (actual time=32310.706..32310.722 rows=100 loops=1)
   ->  Seq Scan on tablename  (cost=0.00..1535284.96 rows=22024696 width=8) (actual time=4764.609..26941.832 rows=22025768 loops=1)
 Total runtime: 32350.899 ms
(3 rows)

但是对于一列上的松散索引扫描,我得到：

Result  (cost=181.28..198.07 rows=101 width=8) (actual time=0.069..1.935 rows=100 loops=1)
   CTE t
     ->  Recursive Union  (cost=1.74..181.28 rows=101 width=8) (actual time=0.062..1.855 rows=101 loops=1)
           ->  Result  (cost=1.74..1.75 rows=1 width=0) (actual time=0.061..0.061 rows=1 loops=1)
                 InitPlan 1 (returns $1)
                   ->  Limit  (cost=0.00..1.74 rows=1 width=8) (actual time=0.057..0.057 rows=1 loops=1)
                         ->  Index Only Scan using tablename_id on tablename  (cost=0.00..38379014.12 rows=22024696 width=8) (actual time=0.055..0.055 rows=1 loops=1)
                               Index Cond: (id IS NOT NULL)
                               Heap Fetches: 0
           ->  WorkTable Scan on t  (cost=0.00..17.75 rows=10 width=8) (actual time=0.017..0.017 rows=1 loops=101)
                 Filter: (id IS NOT NULL)
                 Rows Removed by Filter: 0
                 SubPlan 3
                   ->  Result  (cost=1.75..1.76 rows=1 width=0) (actual time=0.016..0.016 rows=1 loops=100)
                         InitPlan 2 (returns $3)
                           ->  Limit  (cost=0.00..1.75 rows=1 width=8) (actual time=0.016..0.016 rows=1 loops=100)
                                 ->  Index Only Scan using tablename_id on tablename  (cost=0.00..12811462.41 rows=7341565 width=8) (actual time=0.015..0.015 rows=1 loops=100)
                                       Index Cond: ((id IS NOT NULL) AND (id > t.id))
                                       Heap Fetches: 0
   ->  Append  (cost=0.00..16.79 rows=101 width=8) (actual time=0.067..1.918 rows=100 loops=1)
         ->  CTE Scan on t  (cost=0.00..2.02 rows=100 width=8) (actual time=0.067..1.899 rows=100 loops=1)
               Filter: (id IS NOT NULL)
               Rows Removed by Filter: 1
         ->  Result  (cost=13.75..13.76 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
               One-Time Filter: $5
               InitPlan 5 (returns $5)
                 ->  Index Only Scan using tablename_id on tablename  (cost=0.00..13.75 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
                       Index Cond: (id IS NULL)
                       Heap Fetches: 0
 Total runtime: 2.040 ms

完整的表定义如下所示：

CREATE TABLE tablename
(
  source character(25),id bigint NOT NULL,time_ timestamp without time zone,height numeric,lon numeric,lat numeric,distance numeric,status character(3),geom geometry(PointZ,4326),relid bigint
)
WITH (
  OIDS=FALSE
);

CREATE INDEX tablename_height
  ON public.tablename
  USING btree
  (height);

CREATE INDEX tablename_geom
  ON public.tablename
  USING gist
  (geom);


CREATE INDEX tablename_id
  ON public.tablename
  USING btree
  (id);

CREATE INDEX tablename_lat
  ON public.tablename
  USING btree
  (lat);

CREATE INDEX tablename_lon
  ON public.tablename
  USING btree
  (lon);

CREATE INDEX tablename_relid
  ON public.tablename
  USING btree
  (relid);

CREATE INDEX tablename_sid
  ON public.tablename
  USING btree
  (source COLLATE pg_catalog."default",id);

CREATE INDEX tablename_source
  ON public.tablename
  USING btree
  (source COLLATE pg_catalog."default");

CREATE INDEX tablename_time
  ON public.tablename
  USING btree
  (time_);

答案选择：

我花了一些时间来比较所提供的方法.有时这样我希望不止一个答案可以被接受,但在这种情况下,我正在给@jjanes打勾.这样做的原因是他的解决方案与最初提出的问题更紧密地匹配,并且我能够获得关于所需WHERE语句的形式的一些见解.最后,HashAggregate实际上是最快的方法(对我来说),但这是由于我的数据的性质,而不是算法的任何问题.我已经为下面的不同方法添加了EXPLAIN ANALYZE,并且将为jjanes和joop提供1.

HashAggregate：

HashAggregate  (cost=1018669.72..1029722.08 rows=1105236 width=34) (actual time=24164.735..24686.394 rows=1233080 loops=1)
   ->  Seq Scan on tablename  (cost=0.00..908548.48 rows=22024248 width=34) (actual time=0.054..14639.931 rows=22024982 loops=1)
 Total runtime: 24787.292 ms

松散索引扫描修改

CTE Scan on t  (cost=13.84..15.86 rows=100 width=112) (actual time=0.916..250311.164 rows=1233080 loops=1)
   Filter: (source IS NOT NULL)
   Rows Removed by Filter: 1
   CTE t
     ->  Recursive Union  (cost=0.00..13.84 rows=101 width=112) (actual time=0.911..249295.872 rows=1233081 loops=1)
           ->  Limit  (cost=0.00..0.04 rows=1 width=34) (actual time=0.910..0.911 rows=1 loops=1)
                 ->  Index Only Scan using tablename_sid on tablename  (cost=0.00..965442.32 rows=22024248 width=34) (actual time=0.908..0.908 rows=1 loops=1)
                       Heap Fetches: 0
           ->  WorkTable Scan on t  (cost=0.00..1.18 rows=10 width=112) (actual time=0.201..0.201 rows=1 loops=1233081)
                 Filter: (source IS NOT NULL)
                 Rows Removed by Filter: 0
                 SubPlan 1
                   ->  Limit  (cost=0.00..0.05 rows=1 width=34) (actual time=0.100..0.100 rows=1 loops=1233080)
                         ->  Index Only Scan using tablename_sid on tablename  (cost=0.00..340173.38 rows=7341416 width=34) (actual time=0.100..0.100 rows=1 loops=1233080)
                               Index Cond: (ROW(source,id) > ROW(t.source,t.id))
                               Heap Fetches: 0
                 SubPlan 2
                   ->  Limit  (cost=0.00..0.05 rows=1 width=34) (actual time=0.099..0.099 rows=1 loops=1233080)
                         ->  Index Only Scan using tablename_sid on tablename  (cost=0.00..340173.38 rows=7341416 width=34) (actual time=0.098..0.098 rows=1 loops=1233080)
                               Index Cond: (ROW(source,t.id))
                               Heap Fetches: 0
 Total runtime: 250491.559 ms

合并反加入

Merge Anti Join  (cost=0.00..12099015.26 rows=14682832 width=42) (actual time=48.710..541624.677 rows=1233080 loops=1)
   Merge Cond: ((src.source = nx.source) AND (src.id = nx.id))
   Join Filter: (nx.time_ > src.time_)
   Rows Removed by Join Filter: 363464177
   ->  Index Only Scan using tablename_pkey on tablename src  (cost=0.00..1060195.27 rows=22024248 width=42) (actual time=48.566..5064.551 rows=22024982 loops=1)
         Heap Fetches: 0
   ->  Materialize  (cost=0.00..1115255.89 rows=22024248 width=42) (actual time=0.011..40551.997 rows=363464177 loops=1)
         ->  Index Only Scan using tablename_pkey on tablename nx  (cost=0.00..1060195.27 rows=22024248 width=42) (actual time=0.008..8258.890 rows=22024982 loops=1)
               Heap Fetches: 0
 Total runtime: 541750.026 ms

相当可怕,但这似乎工作：

WITH RECURSIVE
     t AS (
  select a,b from (select a,b from foo order by a,b limit 1) asdf union all 
  select (select a from foo where (a,b) > (t.a,t.b) order by a,b limit 1),(select b from foo where (a,b limit 1) 
     from t where t.a is not null)  
select * from t where t.a is not null;

我真的不明白为什么需要“不是空”,因为空值首先出现在哪里？

原文地址：https://www.jb51.cc/postgresql/191892.html

postgresql – 在Postgres上多个字段上的松散索引扫描？

相关推荐