如何解决如何使 postgresql 中的 wordcount 性能接近 flink?
drop table if exists words;
create unlogged table words (w text);
更新 2(插入):
postgres=# explain analyze insert into words (w) select 'The oppressor''s wrong,the proud man''s contumely,' || gs from generate_series(1,1000000,1) gs;
...
Time: 779.863 ms
postgres=# explain analyze select w,count(*) from (select regexp_split_to_table(w,'\W+') w from words) t group by w;
Finalize GroupAggregate (cost=2977788.77..2977814.77 rows=200 width=40) (actual time=10456.530..11693.285 rows=1000008 loops=1)
Group Key: (regexp_split_to_table(words.w,'\W+'::text))
-> Gather Merge (cost=2977788.77..2977811.77 rows=200 width=40) (actual time=10456.511..11467.083 rows=1000016 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Sort (cost=2976788.76..2976789.26 rows=200 width=40) (actual time=10426.911..10779.982 rows=500008 loops=2)
Sort Key: (regexp_split_to_table(words.w,'\W+'::text))
Sort Method: external merge disk: 12888kB
Worker 0: Sort Method: external merge disk: 12624kB
-> Partial HashAggregate (cost=2976779.12..2976781.12 rows=200 width=40) (actual time=8501.496..8738.640 rows=500008 loops=2)
Group Key: regexp_split_to_table(words.w,'\W+'::text)
-> ProjectSet (cost=0.00..2961779.12 rows=588235000 width=32) (actual time=186.627..6793.997 rows=5000000 loops=2)
-> Parallel Seq Scan on words (cost=0.00..16192.35 rows=588235 width=55) (actual time=2.021..110.080 rows=500000 loops=2)
Planning Time: 0.062 ms
JIT:
Functions: 19
Options: Inlining true,Optimization true,Expressions true,Deforming true
Timing: Generation 1.672 ms,Inlining 119.833 ms,Optimization 154.135 ms,Emission 94.765 ms,Total 370.405 ms
Execution Time: 11730.620 ms
(19 rows)
cat /proc/cpuinfo | grep 名称 | cut -f2 -d: | uniq -c
1 Intel(R) Xeon(R) Platinum 8269CY cpu @ 2.50GHz
lsb_release -a
LSB Version: core-11.1.0ubuntu2-noarch:security-11.1.0ubuntu2-noarch
distributor ID: Ubuntu
Description: Ubuntu 20.04.1 LTS
Release: 20.04
Codename: focal
pg_config --configure
'--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=/usr/include' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-silent-rules' '--libdir=/usr/lib/x86_64-linux-gnu' '--libexecdir=/usr/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--disable-dependency-tracking' '--with-icu' '--with-tcl' '--with-perl' '--with-python' '--with-pam' '--with-openssl' '--with-libxml' '--with-libxslt' 'PYTHON=/usr/bin/python3' '--mandir=/usr/share/postgresql/12/man' '--docdir=/usr/share/doc/postgresql-doc-12' '--sysconfdir=/etc/postgresql-common' '--daTarootdir=/usr/share/' '--datadir=/usr/share/postgresql/12' '--bindir=/usr/lib/postgresql/12/bin' '--libdir=/usr/lib/x86_64-linux-gnu/' '--libexecdir=/usr/lib/postgresql/' '--includedir=/usr/include/postgresql/' '--with-extra-version= (Ubuntu 12.5-0ubuntu0.20.04.1)' '--enable-nls' '--enable-integer-datetimes' '--enable-thread-safety' '--enable-tap-tests' '--enable-debug' '--enable-dtrace' '--disable-rpath' '--with-uuid=e2fs' '--with-gnu-ld' '--with-pgport=5432' '--with-system-tzdata=/usr/share/zoneinfo' '--with-llvm' 'LLVM_CONfig=/usr/bin/llvm-config-10' 'CLANG=/usr/bin/clang-10' '--with-systemd' '--with-selinux' 'MKDIR_P=/bin/mkdir -p' 'TAR=/bin/tar' 'CFLAGS=-g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,Now' '--with-gssapi' '--with-ldap' '--with-includes=/usr/include/mit-krb5' '--with-libs=/usr/lib/mit-krb5' '--with-libs=/usr/lib/x86_64-linux-gnu/mit-krb5' 'build_alias=x86_64-linux-gnu' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -fstack-protector-strong -Wformat -Werror=format-security'
regexp_split step和group step都这么慢,怎么办? 还是只是DB和flink之类的区别?
更新 1:
postgres=# explain analyze select trim(w),count(*) from (select unnest(string_to_array(w,' ')) as w from words) t group by 1;
Finalize GroupAggregate (cost=227496.61..227523.11 rows=200 width=40) (actual time=8585.183..10923.381 rows=1000006 loops=1)
Group Key: (btrim(t.w))
-> Gather Merge (cost=227496.61..227519.61 rows=200 width=40) (actual time=8585.160..10674.082 rows=1000012 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Sort (cost=226496.60..226497.10 rows=200 width=40) (actual time=8548.068..9358.588 rows=500006 loops=2)
Sort Key: (btrim(t.w))
Sort Method: external merge disk: 20024kB
Worker 0: Sort Method: external merge disk: 20376kB
-> Partial HashAggregate (cost=226486.45..226488.95 rows=200 width=40) (actual time=3993.156..4279.348 rows=500006 loops=2)
Group Key: btrim(t.w)
-> Subquery Scan on t (cost=0.00..176486.45 rows=10000000 width=32) (actual time=9.028..2656.705 rows=3500000 loops=2)
-> ProjectSet (cost=0.00..51486.45 rows=5882350 width=32) (actual time=9.021..1338.301 rows=3500000 loops=2)
-> Parallel Seq Scan on words (cost=0.00..16192.35 rows=588235 width=55) (actual time=0.014..90.173 rows=500000 loops=2)
Planning Time: 0.084 ms
JIT:
Functions: 19
Options: Inlining false,Optimization false,Deforming true
Timing: Generation 1.754 ms,Inlining 0.000 ms,Optimization 0.650 ms,Emission 17.068 ms,Total 19.472 ms
Execution Time: 10964.820 ms
(20 rows)
Time: 10965.255 ms (00:10.965)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。