PostgreSQL explain返回行的评估

explain返回行是通过对应表的列的统计信息来评估的，在系统表pg_stats中。

建表，插入测试数据：

postgres=# create table tb15(id integer,name character varying);
CREATE TABLE
postgres=# insert into tb15 select generate_series(1,10000),'john';
INSERT 0 10000
postgres=# analyze tb15;
ANALYZE

例1. where条件完全落在柱状图的边界值。

postgres=# explain(analyze,verbose,buffers)select * from tb15 where id<1000;
                                                 QUERY PLAN                                                 
------------------------------------------------------------------------------------------------------------
 Seq Scan on public.tb15  (cost=0.00..180.00 rows=1000 width=9) (actual time=0.009..0.821 rows=999 loops=1)
   Output: id,name
   Filter: (tb15.id < 1000)
   Rows Removed by Filter: 9001
   Buffers: shared hit=55
 Total runtime: 0.901 ms
(6 rows)

postgres=# \x
Expanded display is on

postgres=# select * from pg_stats where tablename ='tb15' and attname='id';
-[ RECORD 1 ]----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
schemaname             | public
tablename              | tb15
attname                | id
inherited              | f
null_frac              | 0
avg_width              | 4
<strong>n_distinct             | -1</strong>
most_common_vals       | 
most_common_freqs      | 
<strong>histogram_bounds       | {1,100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600,1700,1800,1900,2000,2100,2200,2300,2400,2500,2600,2700,2800,2900,3000,3100,3200,3300,3400,3500,3600,3700,3800,3900,4000,4100,4200,4300,4400,4500,4600,4700,4800,4900,5000,5100,5200,5300,5400,5500,5600,5700,5800,5900,6000,6100,6200,6300,6400,6500,6600,6700,6800,6900,7000,7100,7200,7300,7400,7500,7600,7700,7800,7900,8000,8100,8200,8300,8400,8500,8600,8700,8800,8900,9000,9100,9200,9300,9400,9500,9600,9700,9800,9900,10000}</strong>
correlation            | 1
most_common_elems      | 
most_common_elem_freqs | 
elem_count_histogram   |

当where条件不是等于的时候，且n_distinct为-1的时候（表示该字段的值都是唯一的），可以参考pg_stats表中的histogram_bounds字段的统计值，也就是表字段的柱状图。
histogram_bounds 的值受参数 default_statistics_target 控制。default_statistics_target默认值为100，default_statistics_target参数值越大，那么PG搜集的列上的统计信息就越精确，当然在表做 Analyze 操作时花费的时间也稍长些。

postgres=# show default_statistics_target ;
 default_statistics_target 
---------------------------
 100
(1 row)

这里提一下也可以修改字段的统计的default_statistics_target值

postgres=# \h alter table 
Command:     ALTER TABLE
Description: change the definition of a table
Syntax:
ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ]
    action [,... ]
ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ]
    RENAME [ COLUMN ] column_name TO new_column_name
ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ]
    RENAME CONSTRAINT constraint_name TO new_constraint_name
ALTER TABLE [ IF EXISTS ] name
    RENAME TO new_name
ALTER TABLE [ IF EXISTS ] name
    SET SCHEMA new_schema


where action is one of:


    ADD [ COLUMN ] column_name data_type [ COLLATE collation ] [ column_constraint [ ... ] ]
    DROP [ COLUMN ] [ IF EXISTS ] column_name [ RESTRICT | CASCADE ]
    ALTER [ COLUMN ] column_name [ SET DATA ] TYPE data_type [ COLLATE collation ] [ USING expression ]
    ALTER [ COLUMN ] column_name SET DEFAULT expression
    ALTER [ COLUMN ] column_name DROP DEFAULT
    ALTER [ COLUMN ] column_name { SET | DROP } NOT NULL
    <span style="color:#ff0000;">ALTER [ COLUMN ] column_name SET STATISTICS integer --设置字段的default_statistics_target值</span>
    ALTER [ COLUMN ] column_name SET ( attribute_option = value [,... ] )
    ALTER [ COLUMN ] column_name RESET ( attribute_option [,... ] )
    ALTER [ COLUMN ] column_name SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN }
    ADD table_constraint [ NOT VALID ]
    ...
    ..
    .

分析rows=1000是怎么来的？
观察上面的柱状图，id<1000,完全落在了前面10个区间内（1,1000），那么 rows=(10/100)*10000=1000.
10: 符合条件id<1000的10个区间；
100：总共100个区间；
10000：10000行数据。

例2. where条件落在柱状图的非边界值。

postgres=# explain(analyze,buffers)select * from tb15 where id<920;
                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Seq Scan on public.tb15  (cost=0.00..180.00 rows=920 width=9) (actual time=0.013..2.352 rows=919 loops=1)
   Output: id,name
   Filter: (tb15.id < 920)
   Rows Removed by Filter: 9081
   Buffers: shared hit=55
 Total runtime: 2.453 ms
(6 rows)

现在条件是id<920,观察上面的柱状图，id<900的记录完整落在柱状图的前面9个区间里（1,900），id>=900 and id<920的落在了第10个区间，这个时候rows=920的计算方法是： rows=((9+(920-900)/(1000-900.0))/100)*10000=920.
9: 符合条件id<900的9个区间；
(920-900)：符合条件id>=900 and id<920的记录数；
(1000-900.0)：第10个区间的总记录；
(920-900)/(1000-900.0)：符合id>=900 and id<920的在该区间的百分比；
100：总共100个区间；

10000：10000行数据。

例3：有高频值
先添加几个高频值：

postgres=# insert into tb15 select 1,'join' from generate_series(1,500);
INSERT 0 500
postgres=# insert into tb15 select 2,300);
INSERT 0 300
postgres=# insert into tb15 select 3,200);
INSERT 0 200
postgres=# insert into tb15 select 4,100);
INSERT 0 100
postgres=# analyze tb15;
ANALYZE

<pre name="code" class="sql">postgres=# select relname,relpages,reltuples from pg_class  where relname='tb15';
 relname | relpages | reltuples 
---------+----------+-----------
 tb15    |       60 |     11100
(1 row)

现在总共11100条记录。

看一个执行计划：

postgres=# explain(analyze,buffers)select * from tb15 where id=1;
                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Seq Scan on public.tb15  (cost=0.00..198.75 rows=501 width=9) (actual time=0.008..0.922 rows=501 loops=1)
   Output: id,name
   Filter: (tb15.id = 1)
   Rows Removed by Filter: 10599
   Buffers: shared hit=60
 Total runtime: 0.962 ms
(6 rows)

分析rows=501是怎么来的？
首先查看该表的该字段的统计信息：

postgres=# select * from pg_stats where tablename ='tb15' and attname='id';
-[ RECORD 1 ]----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
schemaname             | public
tablename              | tb15
attname                | id
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | -0.900901
most_common_vals       | {1,2,3,4} --高频值
most_common_freqs      | {0.0451351,0.0271171,0.0181081,0.0090991} --高频值对应的占比
<span style="color:#ff0000;">histogram_bounds       | {5,104,204,304,404,504,604,704,804,904,1004,1104,1204,1304,1404,1504,1604,1704,1804,1904,2004,2103,2203,2303,2403,2503,2603,2703,2803,2903,3003,3103,3203,3303,3403,3503,3603,3703,3803,3903,4003,4102,4202,4302,4402,4502,4602,4702,4802,4902,5002,5102,5202,5302,5402,5502,5602,5702,5802,5902,6002,6101,6201,6301,6401,6501,6601,6701,6801,6901,7001,7101,7201,7301,7401,7501,7601,7701,7801,7901,8001,10000}</span>
correlation            | 0.464526
most_common_elems      | 
most_common_elem_freqs | 
elem_count_histogram   |

可以看到柱状图是从5开始的，柱状图已经把高频值排除在外。
如果where条件是高频值里面的，就直接通过高频值对应的占比来计算返回的行数， rows = 11100*0.0451351=500.9996100=501。

第二个例子

postgres=# explain(analyze,buffers)select * from tb15 where id<3;
                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Seq Scan on public.tb15  (cost=0.00..198.75 rows=803 width=9) (actual time=0.010..1.174 rows=802 loops=1)
   Output: id,name
   Filter: (tb15.id < 3)
   Rows Removed by Filter: 10298
   Buffers: shared hit=60
 Total runtime: 1.233 ms
(6 rows)

rows=11100*(0.0451351+0.0271171)=801.9994200;

第三个例子：条件中部分值在高频值内

postgres=# explain(analyze,buffers)select * from tb15 where id=1 or id>9900;
                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Seq Scan on public.tb15  (cost=0.00..226.50 rows=596 width=9) (actual time=0.009..1.673 rows=601 loops=1)
   Output: id,name
   Filter: ((tb15.id = 1) OR (tb15.id > 9900))
   Rows Removed by Filter: 10499
   Buffers: shared hit=60
 Total runtime: 1.726 ms
(6 rows)

rows = 11100*(0.0451351)+1/100.0*(1-0.0451351-0.0271171-0.0181081-0.0090991)*11100=600.9596166

例3：多个列查询条件的选择性相乘评估例子

postgres=# explain select * from tb15 where id=1 and name='john';
                       QUERY PLAN                       
--------------------------------------------------------
 Seq Scan on tb15  (cost=0.00..226.50 rows=451 width=9)
   Filter: ((id = 1) AND ((name)::text = 'john'::text))
(2 rows

postgres=# select * from pg_stats where tablename ='tb15' and attname='id';--id字段的统计信息
-[ RECORD 1 ]----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
schemaname             | public
tablename              | tb15
attname                | id
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | -0.900901
<span style="color:#ff0000;">most_common_vals       | {1,4}
most_common_freqs      | {0.0451351,0.0090991}</span>
histogram_bounds       | {5,10000}
correlation            | 0.464526
most_common_elems      | 
most_common_elem_freqs | 
elem_count_histogram   |

postgres=# select * from pg_stats where tablename ='tb15' and attname='name'; --name字段的统计信息
-[ RECORD 1 ]----------+---------------------
schemaname             | public
tablename              | tb15
attname                | name
inherited              | f
null_frac              | 0
avg_width              | 5
n_distinct             | 2
most_common_vals       | {john,join}
most_common_freqs      | {0.900901,0.0990991}
histogram_bounds       | 
correlation            | 1
most_common_elems      | 
most_common_elem_freqs | 
elem_count_histogram   |

rows = 11100*(0.0451351*0.900901)=451.3510496486100。

PostgreSQL explain返回行的评估

猜你在找的Postgre SQL相关文章