提出这个问题,特别是Postgres,因为它对R树/空间索引有很好的支持.
我们有下表,其中包含单词及其频率的树结构(嵌套集模型):
lexikon ------- _id integer PRIMARY KEY word text frequency integer lset integer UNIQUE KEY rset integer UNIQUE KEY
和查询:
SELECT word FROM lexikon WHERE lset BETWEEN @Low AND @High ORDER BY frequency DESC LIMIT @N
我认为(lset,frequency,word)上的覆盖索引是有用的,但是如果(@High,@ Low)范围内的lset值太多,我觉得它可能表现不佳.
当使用该索引的搜索早期产生与范围条件匹配的@N行时,(频率DESC)上的简单索引有时也可能是足够的.
但似乎性能很大程度上取决于参数值.
有没有一种方法可以让它快速执行,无论范围(@ Low,@ High)是宽还是窄,无论顶级频率字是否幸运地在(窄)选定范围内?
R树/空间索引会有帮助吗?
解决方法
通过先搜索频率较高的行,您可以获得更好的性能.这可以通过“粒化”频率然后在程序上逐步执行来实现,例如如下:
–testbed和lexikon虚拟数据:
begin; set role dba; create role stack; grant stack to dba; create schema authorization stack; set role stack; -- create table lexikon( _id serial,word text,frequency integer,lset integer,width_granule integer); -- insert into lexikon(word,lset) select word,(1000000/row_number() over(order by random()))::integer as frequency,lset from (select 'word'||generate_series(1,1000000) word,generate_series(1,1000000) lset) z; -- update lexikon set width_granule=ln(frequency)::integer; -- create index on lexikon(width_granule,lset); create index on lexikon(lset); -- the second index is not used with the function but is added to make the timings 'fair'
颗粒分析(主要用于信息和调整):
create table granule as select width_granule,count(*) as freq,min(frequency) as granule_start,max(frequency) as granule_end from lexikon group by width_granule; -- select * from granule order by 1; /* width_granule | freq | granule_start | granule_end ---------------+--------+---------------+------------- 0 | 500000 | 1 | 1 1 | 300000 | 2 | 4 2 | 123077 | 5 | 12 3 | 47512 | 13 | 33 4 | 18422 | 34 | 90 5 | 6908 | 91 | 244 6 | 2580 | 245 | 665 7 | 949 | 666 | 1808 8 | 349 | 1811 | 4901 9 | 129 | 4926 | 13333 10 | 47 | 13513 | 35714 11 | 17 | 37037 | 90909 12 | 7 | 100000 | 250000 13 | 2 | 333333 | 500000 14 | 1 | 1000000 | 1000000 */ alter table granule drop column freq; --
首先扫描高频的功能:
create function f(p_lset_low in integer,p_lset_high in integer,p_limit in integer) returns setof lexikon language plpgsql set search_path to 'stack' as $$ declare m integer; n integer := 0; r record; begin for r in (select width_granule from granule order by width_granule desc) loop return query( select * from lexikon where width_granule=r.width_granule and lset>=p_lset_low and lset<=p_lset_high ); get diagnostics m = row_count; n = n+m; exit when n>=p_limit; end loop; end;$$;
结果(时间应该用一小撮盐来进行,但每次查询运行两次以对抗任何缓存)
首先使用我们编写的函数:
\timing on -- select * from f(20000,30000,5) order by frequency desc limit 5; /* _id | word | frequency | lset | width_granule -----+-----------+-----------+-------+--------------- 141 | word23237 | 7092 | 23237 | 9 246 | word25112 | 4065 | 25112 | 8 275 | word23825 | 3636 | 23825 | 8 409 | word28660 | 2444 | 28660 | 8 418 | word29923 | 2392 | 29923 | 8 Time: 80.452 ms */ select * from f(20000,5) order by frequency desc limit 5; /* _id | word | frequency | lset | width_granule -----+-----------+-----------+-------+--------------- 141 | word23237 | 7092 | 23237 | 9 246 | word25112 | 4065 | 25112 | 8 275 | word23825 | 3636 | 23825 | 8 409 | word28660 | 2444 | 28660 | 8 418 | word29923 | 2392 | 29923 | 8 Time: 0.510 ms */
然后使用简单的索引扫描:
select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5; /* _id | word | frequency | lset | width_granule -----+-----------+-----------+-------+--------------- 141 | word23237 | 7092 | 23237 | 9 246 | word25112 | 4065 | 25112 | 8 275 | word23825 | 3636 | 23825 | 8 409 | word28660 | 2444 | 28660 | 8 418 | word29923 | 2392 | 29923 | 8 Time: 218.897 ms */ select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5; /* _id | word | frequency | lset | width_granule -----+-----------+-----------+-------+--------------- 141 | word23237 | 7092 | 23237 | 9 246 | word25112 | 4065 | 25112 | 8 275 | word23825 | 3636 | 23825 | 8 409 | word28660 | 2444 | 28660 | 8 418 | word29923 | 2392 | 29923 | 8 Time: 51.250 ms */ \timing off -- rollback;
根据您的实际数据,您可能希望改变颗粒的数量和用于将行放入其中的函数.频率的实际分布在这里是关键,因为极限条款的预期值和所寻求的lset范围的大小.