pg_trgm是用来做相似度匹配的,在一些情况下也可以拿来代替全文检索做字符匹配。
从大量数据中通过字符串的匹配查找数据的关键是索引,对字符串的精确相等匹配,前缀匹配(like 'x%')和后缀匹配(like '%x')可以使用btree索引,对中缀匹配(like '%x%')和正则表达式匹配就可以用pg_trgm的索引了。
1.环境
Ubuntu 14.04
Postgresql 9.4.0
1000万随机产生的MD5数据的前8个字符。
postgres=#createtabletb(infotext); CREATETABLE postgres=#insertintotbselectsubstring(md5(random()::text),1,8)fromgenerate_series(1,10000000); INSERT010000000 shop_5>SELECT*fromtbwhereinfo~'4c589'limit1; +----------+ |info| |----------| |a64c5891| +----------+ SELECT1 Time:0.003s shop_5>SELECT*fromtbwhereinfo~'4c589'limit20; +----------+ |info| |----------| |a64c5891| |5d4c5899| |b6b4c589| |984c589a| |省略数据| |8d4c5894| |c0e4c589| |974c589b| +----------+ SELECT20 Time:4.035s#直接模糊查询需要4s shop_5>CREATEINDEXtb_info_gin_idxontbusinggin(infogin_trgm_ops); CREATEINDEX Time:54.878s shop_5>SELECT*fromtbwhereinfo~'4c589'limit20; +----------+ |info| |----------| |4c589ea6| |a734c589| |54c58961| |4c589d5d| |a64c5891| |省略数据| |a4c589b7| |4c5894ea| |b4c589f6| |6d14c589| |aa4c5897| +----------+ SELECT20 Time:0.005s#加索引后,模糊查询只需要0.005s shop_5>explainanalyzeSELECT*fromtbwhereinfo~'4c589'limit20; +-------------------------------------------------------------------------------------- |QUERYPLAN |-------------------------------------------------------------------------------------- |Limit(cost=51.75..123.23rows=20width=9)(actualtime=2.431..2.478rows=20loops=1 |->BitmapHeapScanontb(cost=51.75..3625.71rows=1000width=9)(actualtime=2. |RecheckCond:((info)::text~'4c589'::text) |HeapBlocks:exact=20 |->BitmapIndexScanontb_info_gin_idx(cost=0.00..51.50rows=1000width=0 |IndexCond:((info)::text~'4c589'::text) |Planningtime:0.210ms |Executiontime:2.503ms +-------------------------------------------------------------------------------------- EXPLAIN Time:0.005s shop_5>\dtb; +----------+-----------------------+-------------+ |Column|Type|Modifiers| |----------+-----------------------+-------------| |info|charactertext|| +----------+-----------------------+-------------+ Indexes: "tb_info_gin_idx"gin(infogin_trgm_ops) Time:0.004s shop_5>SELECTpg_relation_size('tb_info_gin_idx'); +--------------------+ |pg_relation_size| |--------------------| |234635264| +--------------------+ SELECT1 Time:0.001s shop_5> shop_5>SELECTpg_size_pretty(pg_table_size('tb')); +------------------+ |pg_size_pretty| |------------------| |422MB| +------------------+ SELECT1Time:0.001s shop_5>
###############
这类场景中,gist 效率没有 gin 效率高。
除了like,gin_trgm索引还可以用在正则表达式匹配上。
参考:
http://blog.chinaunix.net/uid-20726500-id-4824895.html
https://yq.aliyun.com/articles/7444