我正在调整一个Web分析工具,使用Vertica作为数据库.我有优化连接的真正问题.我尝试为我的一些查询创建预先连接的预测,尽管它确实使查询变得更快,但它减慢了数据加载到事实表中以进行爬网.
一个简单的INSERT INTO … SELECT * FROM,我们用来从分段表将数据加载到事实表中,从花费约5秒到花费20分钟.
因此,我删除了所有预先连接的预测,并尝试使用数据库设计器来设计查询特定的预测,但这还不够.即使有这些预测,一个简单的连接也要花费大约14秒钟,预先连接的投影需要1秒钟的时间.
我的问题是:对于预先连接的预测来说,如果这样很慢地插入数据插入是正常的,如果没有,这可能是什么罪魁祸首?如果是正常的,那么这对我们来说是一个阻挡,还有其他的技术可以用来加速连接?
我们在5节点集群上运行Vertica,每个节点具有2 x四核cpu和32 GB内存.我的示例查询中的表分别有188,843,085和25,712,878行.
EXPLAIN输出如下所示:
EXPLAIN SELECT referer_via_.url as referralPageUrl,COUNT(DISTINCT sessio n.id) as visits FROM owa_session as session JOIN owa_referer AS referer_vi a_ ON session.referer_id = referer_via_.id WHERE session.yyyymmdd BETWEEN '20121123' AND '20121123' AND session.site_id = '49' GROUP BY referer_via_ .url ORDER BY visits DESC LIMIT 250; Access Path: +-SELECT LIMIT 250 [Cost: 1M,Rows: 250 (STALE STATISTICS)] (PATH ID: 0) | Output Only: 250 tuples | Execute on: Query Initiator | +---> SORT [Cost: 1M,Rows: 1 (STALE STATISTICS)] (PATH ID: 1) | | Order: count(DISTINCT "session".id) DESC | | Output Only: 250 tuples | | Execute on: All Nodes | | +---> GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 1M,Rows: 1 (STALE STATISTICS)] (PATH ID: 2) | | | Aggregates: count(DISTINCT "session".id) | | | Group By: referer_via_.url | | | Execute on: All Nodes | | | +---> GROUPBY HASH (SORT OUTPUT) (RESEGMENT GROUPS) [Cost: 1M,Rows : 1 (STALE STATISTICS)] (PATH ID: 3) | | | | Group By: referer_via_.url,"session".id | | | | Execute on: All Nodes | | | | +---> JOIN HASH [Cost: 1M,Rows: 1 (STALE STATISTICS)] (PATH ID: 4) Outer (RESEGMENT) | | | | | Join Cond: ("session".referer_id = referer_via_.id) | | | | | Execute on: All Nodes | | | | | +-- Outer -> STORAGE ACCESS for session [Cost: 463,Rows: 1 (ST ALE STATISTICS)] (PUSHED GROUPING) (PATH ID: 5) | | | | | | Projection: public.owa_session_projection | | | | | | Materialize: "session".id,"session".referer_id | | | | | | Filter: ("session".site_id = '49') | | | | | | Filter: (("session".yyyymmdd >= 20121123) AND ("session" .yyyymmdd <= 20121123)) | | | | | | Execute on: All Nodes | | | | | +-- Inner -> STORAGE ACCESS for referer_via_ [Cost: 293K,Rows: 26M] (PATH ID: 6) | | | | | | Projection: public.owa_referer_DBD_1_seg_Potency_2012112 2_Potency_20121122 | | | | | | Materialize: referer_via_.id,referer_via_.url | | | | | | Execute on: All Nodes
解决方法
加速加入:
– 将会话表设计为“yyyymmdd”列.这将启用分区修剪
– 将列“yyyymmdd”的条件添加到_referer_via_并对其进行分区,如果可能(很可能)
– 将site_id尽可能接近于会话的二次(超)投影中的排序顺序开始
– 将会话表设计为“yyyymmdd”列.这将启用分区修剪
– 将列“yyyymmdd”的条件添加到_referer_via_并对其进行分区,如果可能(很可能)
– 将site_id尽可能接近于会话的二次(超)投影中的排序顺序开始
接下来,相应地将referer_id和id两个表分割.
并且在集群节点中有更多的帮助.