这是我的问题的一个简单示例:
CREATE TABLE test1 (id SERIAL,key TEXT UNIQUE,value TEXT); CREATE TABLE test2 (id SERIAL,value TEXT); INSERT INTO test1 (key,value) SELECT i::TEXT,'ABC' || i::TEXT FROM generate_series(0,1000000) AS i; INSERT INTO test2 (key,'ABC' || (i+1000)::TEXT FROM generate_series(0,600000) AS i; INSERT INTO test2 (key,'ABC' || (i+1000)::TEXT FROM generate_series(1000000,1200000) AS i; CREATE INDEX test1_key ON test1 (key); CREATE INDEX test1_value ON test1 (value); CREATE INDEX test2_key ON test2 (key); CREATE INDEX test2_value ON test2 (value); VACUUM FULL VERBOSE ANALYZE test1; VACUUM FULL VERBOSE ANALYZE test2;
这是我目前使用的查询,但需要超过6秒.
EXPLAIN ANALYZE SELECT test1.key AS key1,test1.value AS value1,test2.key AS key2,test2.value AS value2 FROM test1 LEFT OUTER JOIN test2 ON (test1.key = test2.key) WHERE test1.value = 'ABC1234' OR test2.value = 'ABC1234'; key1 | value1 | key2 | value2 ------+---------+------+--------- 234 | ABC234 | 234 | ABC1234 1234 | ABC1234 | 1234 | ABC2234 (2 rows) QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Hash Left Join (cost=27344.05..79728.10 rows=2 width=32) (actual time=5428.635..6097.098 rows=2 loops=1) Hash Cond: (test1.key = test2.key) Filter: ((test1.value = 'ABC1234'::text) OR (test2.value = 'ABC1234'::text)) -> Seq Scan on test1 (cost=0.00..16321.01 rows=1000001 width=15) (actual time=0.009..1057.315 rows=1000001 loops=1) -> Hash (cost=13047.02..13047.02 rows=800002 width=17) (actual time=2231.964..2231.964 rows=800002 loops=1) Buckets: 65536 Batches: 2 Memory Usage: 14551kB -> Seq Scan on test2 (cost=0.00..13047.02 rows=800002 width=17) (actual time=0.010..980.232 rows=800002 loops=1) Total runtime: 6109.042 ms (8 rows)
在这两个表中,只有极少数数据集符合要求,但似乎没有观察到这一事实.我可以改为使用这样的查询:
EXPLAIN ANALYZE SELECT coalesce(test1.key,test3.key1) AS key1,coalesce(test1.value,test3.value1) AS value1,coalesce(test2.key,test3.key2) AS key2,coalesce(test2.value,test3.value2) AS value2 FROM (SELECT test1.key AS key1,test2.value AS value2 FROM (SELECT key,value FROM test1 WHERE value = 'ABC1234') AS test1 FULL JOIN (SELECT key,value FROM test2 WHERE value = 'ABC1234') AS test2 ON (test1.key = test2.key)) AS test3 LEFT OUTER JOIN test1 ON (test1.key = test3.key2) LEFT OUTER JOIN test2 ON (test2.key = test3.key1) WHERE test1.key IS NOT NULL; key1 | value1 | key2 | value2 ------+---------+------+--------- 1234 | ABC1234 | 1234 | ABC2234 234 | ABC234 | 234 | ABC1234 (2 rows) QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ Nested Loop Left Join (cost=0.00..33.56 rows=1 width=64) (actual time=0.075..0.083 rows=1 loops=1) -> Nested Loop (cost=0.00..25.19 rows=1 width=47) (actual time=0.066..0.072 rows=1 loops=1) -> Nested Loop Left Join (cost=0.00..16.80 rows=1 width=32) (actual time=0.051..0.054 rows=1 loops=1) -> Index Scan using test2_value_key on test2 (cost=0.00..8.41 rows=1 width=17) (actual time=0.026..0.027 rows=1 loops=1) Index Cond: (value = 'ABC1234'::text) -> Index Scan using test1_key on test1 (cost=0.00..8.38 rows=1 width=15) (actual time=0.020..0.020 rows=0 loops=1) Index Cond: (public.test1.key = public.test2.key) Filter: (public.test1.value = 'ABC1234'::text) -> Index Scan using test1_key on test1 (cost=0.00..8.38 rows=1 width=15) (actual time=0.011..0.013 rows=1 loops=1) Index Cond: ((public.test1.key IS NOT NULL) AND (public.test1.key = public.test2.key)) -> Index Scan using test2_key on test2 (cost=0.00..8.36 rows=1 width=17) (actual time=0.001..0.001 rows=0 loops=1) Index Cond: (public.test2.key = public.test1.key) Total runtime: 0.139 ms
以下查询更简单,但仍然太慢:
EXPLAIN ANALYZE SELECT test1.key AS key1,test2.value AS value2 FROM test1 LEFT OUTER JOIN test2 ON (test1.key = test2.key) WHERE test1.value = 'ABC1234' OR EXISTS (SELECT 1 FROM test2 t WHERE t.key = test1.key AND t.value = 'ABC1234'); key1 | value1 | key2 | value2 ------+---------+------+--------- 1234 | ABC1234 | 1234 | ABC2234 234 | ABC234 | 234 | ABC1234 (2 rows) QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Merge Left Join (cost=0.00..8446826.32 rows=500001 width=32) (actual time=615.706..1651.370 rows=2 loops=1) Merge Cond: (test1.key = test2.key) -> Index Scan using test1_key on test1 (cost=0.00..8398983.25 rows=500001 width=15) (actual time=28.449..734.567 rows=2 loops=1) Filter: ((value = 'ABC1234'::text) OR (alternatives: SubPlan 1 or hashed SubPlan 2)) SubPlan 1 -> Index Scan using test2_key on test2 t (cost=0.00..8.36 rows=1 width=0) (never executed) Index Cond: (key = $0) Filter: (value = 'ABC1234'::text) SubPlan 2 -> Index Scan using test2_value on test2 t (cost=0.00..8.37 rows=1 width=7) (actual time=0.376..0.380 rows=1 loops=1) Index Cond: (value = 'ABC1234'::text) -> Index Scan using test2_key on test2 (cost=0.00..39593.05 rows=800002 width=17) (actual time=0.019..498.456 rows=348894 loops=1) Total runtime: 1651.453 ms (13 rows)
所以我的问题是:是否存在一个简单的查询,它将导致类似于第二个查询的类似快速执行计划,或者可能是规划者的索引或某种提示.
(我知道在这个例子中,只有一个表中包含两个值是合理的.但实际上表格更复杂,表格方案不能轻易改变.)
Postgresql Version: 9.0.3 shared_buffers = 64MB effective_cache_size = 32MB work_mem = 16MB maintenance_work_mem = 32MB temp_buffers = 8MB wal_buffers= 1MB
编辑:根据Kipotlov的建议,这里是UNION版本.为什么正常的OR查询没有选择这么好的计划?
EXPLAIN ANALYZE SELECT test1.key AS key1,test2.value AS value2 FROM test1 LEFT OUTER JOIN test2 ON (test1.key = test2.key) WHERE test1.value = 'ABC1234' UNION SELECT test1.key AS key1,test2.value AS value2 FROM test1 LEFT OUTER JOIN test2 ON (test1.key = test2.key) WHERE test2.value = 'ABC1234'; key1 | value1 | key2 | value2 ------+---------+------+--------- 1234 | ABC1234 | 1234 | ABC2234 234 | ABC234 | 234 | ABC1234 (2 rows) QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------ Unique (cost=33.64..33.66 rows=2 width=32) (actual time=0.114..0.119 rows=2 loops=1) -> Sort (cost=33.64..33.64 rows=2 width=32) (actual time=0.111..0.113 rows=2 loops=1) Sort Key: public.test1.key,public.test1.value,public.test2.key,public.test2.value Sort Method: quicksort Memory: 17kB -> Append (cost=0.00..33.63 rows=2 width=32) (actual time=0.046..0.097 rows=2 loops=1) -> Nested Loop Left Join (cost=0.00..16.81 rows=1 width=32) (actual time=0.044..0.050 rows=1 loops=1) -> Index Scan using test1_value_key on test1 (cost=0.00..8.44 rows=1 width=15) (actual time=0.023..0.024 rows=1 loops=1) Index Cond: (value = 'ABC1234'::text) -> Index Scan using test2_key on test2 (cost=0.00..8.36 rows=1 width=17) (actual time=0.014..0.016 rows=1 loops=1) Index Cond: (public.test1.key = public.test2.key) -> Nested Loop (cost=0.00..16.80 rows=1 width=32) (actual time=0.036..0.041 rows=1 loops=1) -> Index Scan using test2_value_key on test2 (cost=0.00..8.41 rows=1 width=17) (actual time=0.019..0.020 rows=1 loops=1) Index Cond: (value = 'ABC1234'::text) -> Index Scan using test1_key on test1 (cost=0.00..8.38 rows=1 width=15) (actual time=0.013..0.015 rows=1 loops=1) Index Cond: (public.test1.key = public.test2.key) Total runtime: 0.173 ms (16 rows)
解决方法
首先,感谢非常详细的问题.在询问之前,很难找到研究过他们问题的人.
我一直在考虑这个问题,问题似乎是Postgresql希望加入所有行,因为test1中的每个非匹配行可能在test2中匹配 – 反之亦然.
解决方案是强制规划器分两步执行查询.一种方法是您已经尝试过的大型UNION查询 – 强制它在单独的查询中考虑每个表达式.
另一种方法是强制规划器首先找到匹配的密钥,然后执行连接,这样就不会有歧义:
EXPLAIN ANALYZE SELECT test1.key AS key1,test2.value AS value2 FROM ( SELECT key FROM test1 WHERE value='ABC1234' UNION SELECT key FROM test2 WHERE value='ABC1234' ) AS matching_keys INNER JOIN test1 USING (key) LEFT OUTER JOIN test2 USING (key); Nested Loop Left Join (cost=16.84..34.44 rows=2 width=32) (actual time=0.211..0.280 rows=2 loops=1) -> Nested Loop (cost=16.84..33.65 rows=2 width=15) (actual time=0.175..0.212 rows=2 loops=1) -> Unique (cost=16.84..16.85 rows=2 width=6) (actual time=0.132..0.136 rows=2 loops=1) -> Sort (cost=16.84..16.85 rows=2 width=6) (actual time=0.131..0.132 rows=2 loops=1) Sort Key: public.test1.key Sort Method: quicksort Memory: 25kB -> Append (cost=0.00..16.83 rows=2 width=6) (actual time=0.058..0.110 rows=2 loops=1) -> Index Scan using test1_value on test1 (cost=0.00..8.42 rows=1 width=6) (actual time=0.056..0.058 rows=1 loops=1) Index Cond: (value = 'ABC1234'::text) -> Index Scan using test2_value on test2 (cost=0.00..8.39 rows=1 width=7) (actual time=0.046..0.047 rows=1 loops=1) Index Cond: (value = 'ABC1234'::text) -> Index Scan using test1_key on test1 (cost=0.00..8.38 rows=1 width=15) (actual time=0.032..0.033 rows=1 loops=2) Index Cond: (key = public.test1.key) -> Index Scan using test2_key on test2 (cost=0.00..0.38 rows=1 width=17) (actual time=0.028..0.029 rows=1 loops=2) Index Cond: (public.test1.key = key) Total runtime: 0.390 ms (16 rows)
同样,UNION扮演OR的角色.不幸的是,对于像>’ABC1234’这样的查询,这种方法仍然表现不佳.你可以通过提高work_mem来改善它.我在这里不知所措.
至于你的上一个问题:
Why does the normal OR query not choose such a good plan?
因为Postgresql规划器目前缺乏将OR’ed表达式拆分为单独的UNION查询的能力.有一些警告使它比看起来更难.