explain返回行是通过对应表的列的统计信息来评估的,在系统表pg_stats中。
建表,插入测试数据:
postgres=# create table tb15(id integer,name character varying); CREATE TABLE postgres=# insert into tb15 select generate_series(1,10000),'john'; INSERT 0 10000 postgres=# analyze tb15; ANALYZE例1. where条件完全落在柱状图的边界值。
postgres=# explain(analyze,verbose,buffers)select * from tb15 where id<1000; QUERY PLAN ------------------------------------------------------------------------------------------------------------ Seq Scan on public.tb15 (cost=0.00..180.00 rows=1000 width=9) (actual time=0.009..0.821 rows=999 loops=1) Output: id,name Filter: (tb15.id < 1000) Rows Removed by Filter: 9001 Buffers: shared hit=55 Total runtime: 0.901 ms (6 rows)
postgres=# \x Expanded display is on
postgres=# select * from pg_stats where tablename ='tb15' and attname='id'; -[ RECORD 1 ]----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- schemaname | public tablename | tb15 attname | id inherited | f null_frac | 0 avg_width | 4 <strong>n_distinct | -1</strong> most_common_vals | most_common_freqs | <strong>histogram_bounds | {1,100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600,1700,1800,1900,2000,2100,2200,2300,2400,2500,2600,2700,2800,2900,3000,3100,3200,3300,3400,3500,3600,3700,3800,3900,4000,4100,4200,4300,4400,4500,4600,4700,4800,4900,5000,5100,5200,5300,5400,5500,5600,5700,5800,5900,6000,6100,6200,6300,6400,6500,6600,6700,6800,6900,7000,7100,7200,7300,7400,7500,7600,7700,7800,7900,8000,8100,8200,8300,8400,8500,8600,8700,8800,8900,9000,9100,9200,9300,9400,9500,9600,9700,9800,9900,10000}</strong> correlation | 1 most_common_elems | most_common_elem_freqs | elem_count_histogram |当where条件不是等于的时候,且n_distinct为-1的时候(表示该字段的值都是唯一的),可以参考pg_stats表中的histogram_bounds字段的统计值,也就是表字段的柱状图。
histogram_bounds 的值受参数 default_statistics_target 控制。default_statistics_target默认值为100,default_statistics_target参数值越大,那么PG搜集的列上的统计信息就越精确,当然在表做 Analyze 操作时花费的时间也稍长些。
postgres=# show default_statistics_target ; default_statistics_target --------------------------- 100 (1 row)这里提一下也可以修改字段的统计的default_statistics_target值
postgres=# \h alter table Command: ALTER TABLE Description: change the definition of a table Syntax: ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ] action [,... ] ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ] RENAME [ COLUMN ] column_name TO new_column_name ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ] RENAME CONSTRAINT constraint_name TO new_constraint_name ALTER TABLE [ IF EXISTS ] name RENAME TO new_name ALTER TABLE [ IF EXISTS ] name SET SCHEMA new_schema where action is one of: ADD [ COLUMN ] column_name data_type [ COLLATE collation ] [ column_constraint [ ... ] ] DROP [ COLUMN ] [ IF EXISTS ] column_name [ RESTRICT | CASCADE ] ALTER [ COLUMN ] column_name [ SET DATA ] TYPE data_type [ COLLATE collation ] [ USING expression ] ALTER [ COLUMN ] column_name SET DEFAULT expression ALTER [ COLUMN ] column_name DROP DEFAULT ALTER [ COLUMN ] column_name { SET | DROP } NOT NULL <span style="color:#ff0000;">ALTER [ COLUMN ] column_name SET STATISTICS integer --设置字段的default_statistics_target值</span> ALTER [ COLUMN ] column_name SET ( attribute_option = value [,... ] ) ALTER [ COLUMN ] column_name RESET ( attribute_option [,... ] ) ALTER [ COLUMN ] column_name SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN } ADD table_constraint [ NOT VALID ] ... .. .分析rows=1000是怎么来的?
观察上面的柱状图,id<1000,完全落在了前面10个区间内(1,1000),那么 rows=(10/100)*10000=1000.
10: 符合条件id<1000的10个区间;
100:总共100个区间;
10000:10000行数据。
例2. where条件落在柱状图的非边界值。
postgres=# explain(analyze,buffers)select * from tb15 where id<920; QUERY PLAN ----------------------------------------------------------------------------------------------------------- Seq Scan on public.tb15 (cost=0.00..180.00 rows=920 width=9) (actual time=0.013..2.352 rows=919 loops=1) Output: id,name Filter: (tb15.id < 920) Rows Removed by Filter: 9081 Buffers: shared hit=55 Total runtime: 2.453 ms (6 rows)现在条件是id<920,观察上面的柱状图,id<900的记录完整落在柱状图的前面9个区间里(1,900),id>=900 and id<920的落在了第10个区间,这个时候rows=920的计算方法是: rows=((9+(920-900)/(1000-900.0))/100)*10000=920.
9: 符合条件id<900的9个区间;
(920-900):符合条件id>=900 and id<920的记录数;
(1000-900.0):第10个区间的总记录;
(920-900)/(1000-900.0): 符合id>=900 and id<920的在该区间的百分比;
100:总共100个区间;
10000:10000行数据。
先添加几个高频值:
postgres=# insert into tb15 select 1,'join' from generate_series(1,500); INSERT 0 500 postgres=# insert into tb15 select 2,300); INSERT 0 300 postgres=# insert into tb15 select 3,200); INSERT 0 200 postgres=# insert into tb15 select 4,100); INSERT 0 100 postgres=# analyze tb15; ANALYZE
<pre name="code" class="sql">postgres=# select relname,relpages,reltuples from pg_class where relname='tb15'; relname | relpages | reltuples ---------+----------+----------- tb15 | 60 | 11100 (1 row)
现在总共11100条记录。
看一个执行计划:postgres=# explain(analyze,buffers)select * from tb15 where id=1; QUERY PLAN ----------------------------------------------------------------------------------------------------------- Seq Scan on public.tb15 (cost=0.00..198.75 rows=501 width=9) (actual time=0.008..0.922 rows=501 loops=1) Output: id,name Filter: (tb15.id = 1) Rows Removed by Filter: 10599 Buffers: shared hit=60 Total runtime: 0.962 ms (6 rows)分析rows=501是怎么来的?
首先查看该表的该字段的统计信息:
postgres=# select * from pg_stats where tablename ='tb15' and attname='id'; -[ RECORD 1 ]----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- schemaname | public tablename | tb15 attname | id inherited | f null_frac | 0 avg_width | 4 n_distinct | -0.900901 most_common_vals | {1,2,3,4} --高频值 most_common_freqs | {0.0451351,0.0271171,0.0181081,0.0090991} --高频值对应的占比 <span style="color:#ff0000;">histogram_bounds | {5,104,204,304,404,504,604,704,804,904,1004,1104,1204,1304,1404,1504,1604,1704,1804,1904,2004,2103,2203,2303,2403,2503,2603,2703,2803,2903,3003,3103,3203,3303,3403,3503,3603,3703,3803,3903,4003,4102,4202,4302,4402,4502,4602,4702,4802,4902,5002,5102,5202,5302,5402,5502,5602,5702,5802,5902,6002,6101,6201,6301,6401,6501,6601,6701,6801,6901,7001,7101,7201,7301,7401,7501,7601,7701,7801,7901,8001,10000}</span> correlation | 0.464526 most_common_elems | most_common_elem_freqs | elem_count_histogram |可以看到柱状图是从5开始的,柱状图已经把高频值排除在外。
如果where条件是高频值里面的,就直接通过高频值对应的占比来计算返回的行数, rows = 11100*0.0451351=500.9996100=501。
第二个例子
postgres=# explain(analyze,buffers)select * from tb15 where id<3; QUERY PLAN ----------------------------------------------------------------------------------------------------------- Seq Scan on public.tb15 (cost=0.00..198.75 rows=803 width=9) (actual time=0.010..1.174 rows=802 loops=1) Output: id,name Filter: (tb15.id < 3) Rows Removed by Filter: 10298 Buffers: shared hit=60 Total runtime: 1.233 ms (6 rows)
rows=11100*(0.0451351+0.0271171)=801.9994200;
postgres=# explain(analyze,buffers)select * from tb15 where id=1 or id>9900; QUERY PLAN ----------------------------------------------------------------------------------------------------------- Seq Scan on public.tb15 (cost=0.00..226.50 rows=596 width=9) (actual time=0.009..1.673 rows=601 loops=1) Output: id,name Filter: ((tb15.id = 1) OR (tb15.id > 9900)) Rows Removed by Filter: 10499 Buffers: shared hit=60 Total runtime: 1.726 ms (6 rows)rows = 11100*(0.0451351)+1/100.0*(1-0.0451351-0.0271171-0.0181081-0.0090991)*11100=600.9596166
例3:多个列查询条件的选择性相乘评估例子
postgres=# explain select * from tb15 where id=1 and name='john'; QUERY PLAN -------------------------------------------------------- Seq Scan on tb15 (cost=0.00..226.50 rows=451 width=9) Filter: ((id = 1) AND ((name)::text = 'john'::text)) (2 rows
postgres=# select * from pg_stats where tablename ='tb15' and attname='id';--id字段的统计信息 -[ RECORD 1 ]----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- schemaname | public tablename | tb15 attname | id inherited | f null_frac | 0 avg_width | 4 n_distinct | -0.900901 <span style="color:#ff0000;">most_common_vals | {1,4} most_common_freqs | {0.0451351,0.0090991}</span> histogram_bounds | {5,10000} correlation | 0.464526 most_common_elems | most_common_elem_freqs | elem_count_histogram |
postgres=# select * from pg_stats where tablename ='tb15' and attname='name'; --name字段的统计信息 -[ RECORD 1 ]----------+--------------------- schemaname | public tablename | tb15 attname | name inherited | f null_frac | 0 avg_width | 5 n_distinct | 2 most_common_vals | {john,join} most_common_freqs | {0.900901,0.0990991} histogram_bounds | correlation | 1 most_common_elems | most_common_elem_freqs | elem_count_histogram |
rows = 11100*(0.0451351*0.900901)=451.3510496486100。
原文链接:https://www.f2er.com/postgresql/195343.html