如何解决如何使 postgresql 中的 wordcount 性能接近 flink？

drop table if exists words;
create unlogged table words (w text);

更新 2（插入）：

postgres=# explain analyze insert into words (w) select 'The oppressor''s wrong,the proud man''s contumely,' || gs from generate_series(1,1000000,1) gs;
... 
Time: 779.863 ms

postgres=# explain analyze select w,count(*) from (select regexp_split_to_table(w,'\W+') w from words) t group by w;

Finalize GroupAggregate  (cost=2977788.77..2977814.77 rows=200 width=40) (actual time=10456.530..11693.285 rows=1000008 loops=1)
   Group Key: (regexp_split_to_table(words.w,'\W+'::text))
   ->  Gather Merge  (cost=2977788.77..2977811.77 rows=200 width=40) (actual time=10456.511..11467.083 rows=1000016 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         ->  Sort  (cost=2976788.76..2976789.26 rows=200 width=40) (actual time=10426.911..10779.982 rows=500008 loops=2)
               Sort Key: (regexp_split_to_table(words.w,'\W+'::text))
               Sort Method: external merge  disk: 12888kB
               Worker 0:  Sort Method: external merge  disk: 12624kB
               ->  Partial HashAggregate  (cost=2976779.12..2976781.12 rows=200 width=40) (actual time=8501.496..8738.640 rows=500008 loops=2)
                     Group Key: regexp_split_to_table(words.w,'\W+'::text)
                     ->  ProjectSet  (cost=0.00..2961779.12 rows=588235000 width=32) (actual time=186.627..6793.997 rows=5000000 loops=2)
                           ->  Parallel Seq Scan on words  (cost=0.00..16192.35 rows=588235 width=55) (actual time=2.021..110.080 rows=500000 loops=2)
 Planning Time: 0.062 ms
 JIT:
   Functions: 19
   Options: Inlining true,Optimization true,Expressions true,Deforming true
   Timing: Generation 1.672 ms,Inlining 119.833 ms,Optimization 154.135 ms,Emission 94.765 ms,Total 370.405 ms
 Execution Time: 11730.620 ms
(19 rows)

cpuinfo | grep 名称 | cut -f2 -d: | uniq -c

  1  Intel(R) Xeon(R) Platinum 8269CY cpu @ 2.50GHz

lsb_release -a

LSB Version:    core-11.1.0ubuntu2-noarch:security-11.1.0ubuntu2-noarch
distributor ID: Ubuntu
Description:    Ubuntu 20.04.1 LTS
Release:    20.04
Codename:   focal

pg_config --configure

'--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=/usr/include' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-silent-rules' '--libdir=/usr/lib/x86_64-linux-gnu' '--libexecdir=/usr/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--disable-dependency-tracking' '--with-icu' '--with-tcl' '--with-perl' '--with-python' '--with-pam' '--with-openssl' '--with-libxml' '--with-libxslt' 'PYTHON=/usr/bin/python3' '--mandir=/usr/share/postgresql/12/man' '--docdir=/usr/share/doc/postgresql-doc-12' '--sysconfdir=/etc/postgresql-common' '--daTarootdir=/usr/share/' '--datadir=/usr/share/postgresql/12' '--bindir=/usr/lib/postgresql/12/bin' '--libdir=/usr/lib/x86_64-linux-gnu/' '--libexecdir=/usr/lib/postgresql/' '--includedir=/usr/include/postgresql/' '--with-extra-version= (Ubuntu 12.5-0ubuntu0.20.04.1)' '--enable-nls' '--enable-integer-datetimes' '--enable-thread-safety' '--enable-tap-tests' '--enable-debug' '--enable-dtrace' '--disable-rpath' '--with-uuid=e2fs' '--with-gnu-ld' '--with-pgport=5432' '--with-system-tzdata=/usr/share/zoneinfo' '--with-llvm' 'LLVM_CONfig=/usr/bin/llvm-config-10' 'CLANG=/usr/bin/clang-10' '--with-systemd' '--with-selinux' 'MKDIR_P=/bin/mkdir -p' 'TAR=/bin/tar' 'CFLAGS=-g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,Now' '--with-gssapi' '--with-ldap' '--with-includes=/usr/include/mit-krb5' '--with-libs=/usr/lib/mit-krb5' '--with-libs=/usr/lib/x86_64-linux-gnu/mit-krb5' 'build_alias=x86_64-linux-gnu' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -fstack-protector-strong -Wformat -Werror=format-security'

regexp_split step和group step都这么慢，怎么办？还是只是DB和flink之类的区别？

更新 1：

postgres=# explain analyze select trim(w),count(*) from (select unnest(string_to_array(w,' '))  as w from words) t group by 1;
                                                                         
Finalize GroupAggregate  (cost=227496.61..227523.11 rows=200 width=40) (actual time=8585.183..10923.381 rows=1000006 loops=1)
   Group Key: (btrim(t.w))
   ->  Gather Merge  (cost=227496.61..227519.61 rows=200 width=40) (actual time=8585.160..10674.082 rows=1000012 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         ->  Sort  (cost=226496.60..226497.10 rows=200 width=40) (actual time=8548.068..9358.588 rows=500006 loops=2)
               Sort Key: (btrim(t.w))
               Sort Method: external merge  disk: 20024kB
               Worker 0:  Sort Method: external merge  disk: 20376kB
               ->  Partial HashAggregate  (cost=226486.45..226488.95 rows=200 width=40) (actual time=3993.156..4279.348 rows=500006 loops=2)
                     Group Key: btrim(t.w)
                     ->  Subquery Scan on t  (cost=0.00..176486.45 rows=10000000 width=32) (actual time=9.028..2656.705 rows=3500000 loops=2)
                           ->  ProjectSet  (cost=0.00..51486.45 rows=5882350 width=32) (actual time=9.021..1338.301 rows=3500000 loops=2)
                                 ->  Parallel Seq Scan on words  (cost=0.00..16192.35 rows=588235 width=55) (actual time=0.014..90.173 rows=500000 loops=2)
 Planning Time: 0.084 ms
 JIT:
   Functions: 19
   Options: Inlining false,Optimization false,Deforming true
   Timing: Generation 1.754 ms,Inlining 0.000 ms,Optimization 0.650 ms,Emission 17.068 ms,Total 19.472 ms
 Execution Time: 10964.820 ms
(20 rows)

Time: 10965.255 ms (00:10.965)

如何使 postgresql 中的 wordcount 性能接近 flink？

如何解决如何使 postgresql 中的 wordcount 性能接近 flink？

cat /proc/cpuinfo | grep 名称 | cut -f2 -d: | uniq -c

lsb_release -a

pg_config --configure

相关推荐