然而,在最近的几个场景中,我测量了使用计数时的性能提升.
模式如下:
LEFT JOIN ( SELECT someID,COUNT(*) FROM soMetable GROUP BY someID ) AS Alias ON ( Alias.someID = mainTable.ID )
我不熟悉告诉sql Server“内部”发生了什么的方法,所以我想知道是否存在一个带有EXISTS的无法解决的缺陷,这对我已经完成的测量非常有意义(可以说是RBAR吗?!).
你对这种现象有一些解释吗?
编辑:
这是您可以运行的完整脚本:
SET NOCOUNT ON SET STATISTICS IO OFF DECLARE @tmp1 TABLE ( ID INT UNIQUE ) DECLARE @tmp2 TABLE ( ID INT,X INT IDENTITY,UNIQUE (ID,X) ) ; WITH T(n) AS ( SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM master.dbo.spt_values AS S ),tally(n) AS ( SELECT T2.n * 100 + T1.n FROM T AS T1 CROSS JOIN T AS T2 WHERE T1.n <= 100 AND T2.n <= 100 ) INSERT @tmp1 SELECT n FROM tally AS T1 WHERE n < 10000 ; WITH T(n) AS ( SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM master.dbo.spt_values AS S ),tally(n) AS ( SELECT T2.n * 100 + T1.n FROM T AS T1 CROSS JOIN T AS T2 WHERE T1.n <= 100 AND T2.n <= 100 ) INSERT @tmp2 SELECT T1.n FROM tally AS T1 CROSS JOIN T AS T2 WHERE T1.n < 10000 AND T1.n % 3 <> 0 AND T2.n < 1 + T1.n % 15 PRINT ' COUNT Version: ' WAITFOR DELAY '00:00:01' SET STATISTICS IO ON SET STATISTICS TIME ON SELECT T1.ID,CASE WHEN n > 0 THEN 1 ELSE 0 END AS DoesExist FROM @tmp1 AS T1 LEFT JOIN ( SELECT T2.ID,COUNT(*) AS n FROM @tmp2 AS T2 GROUP BY T2.ID ) AS T2 ON ( T2.ID = T1.ID ) WHERE T1.ID BETWEEN 5000 AND 7000 OPTION (RECOMPILE) -- required since table are filled within the same scope SET STATISTICS TIME OFF PRINT ' EXISTS Version:' WAITFOR DELAY '00:00:01' SET STATISTICS TIME ON SELECT T1.ID,CASE WHEN EXISTS ( SELECT 1 FROM @tmp2 AS T2 WHERE T2.ID = T1.ID ) THEN 1 ELSE 0 END AS DoesExist FROM @tmp1 AS T1 WHERE T1.ID BETWEEN 5000 AND 7000 OPTION (RECOMPILE) -- required since table are filled within the same scope SET STATISTICS TIME OFF
在sql Server 2008R2(七个64位)上我得到了这个结果
COUNT版本:
Table ‘#455F344D’. Scan count 1,logical reads 8,physical reads 0,read-ahead reads 0,lob logical reads 0,lob physical reads 0,lob read-ahead reads 0.
Table ‘#492FC531’. Scan count 1,logical reads 30,lob read-ahead reads 0.sql Server Execution Times:
cpu time = 0 ms,elapsed time = 81 ms.
EXISTS版本:
Table ‘#492FC531’. Scan count 1,logical reads 96,lob read-ahead reads 0.
Table ‘#455F344D’. Scan count 1,elapsed time = 76 ms.
解决方法
I’ve often read when one had to check existence of a row should always be done with EXISTS instead of with a COUNT.
任何事情都是非常罕见的,特别是涉及到数据库时.在sql中有许多表达相同语义的方法.如果有一个有用的经验法则,可能是使用最自然的语法编写查询(并且,是的,这是主观的),并且只有在您获得的查询计划或性能不可接受时才考虑重写.
对于它的价值,我自己对这个问题的看法是存在查询最自然地使用EXISTS来表达.这也是我的经验,EXISTS tends to optimize better比OUTER JOIN拒绝NULL替代.使用COUNT(*)和过滤= 0是另一种选择,恰好在sql Server查询优化器中有一些支持,但我个人发现这在更复杂的查询中是不可靠的.无论如何,对于我来说,EXISTS似乎比任何一种替代品更自然.
I was wondering if there was a unheralded flaw with EXISTS that gave perfectly sense to the measurements I’ve done
您的特定示例很有趣,因为它突出了优化程序处理CASE表达式(特别是EXISTS测试)中的子查询的方式.
CASE表达式中的子查询
考虑以下(完全合法的)查询:
DECLARE @Base AS TABLE (a integer NULL); DECLARE @When AS TABLE (b integer NULL); DECLARE @Then AS TABLE (c integer NULL); DECLARE @Else AS TABLE (d integer NULL); SELECT CASE WHEN (SELECT W.b FROM @When AS W) = 1 THEN (SELECT T.c FROM @Then AS T) ELSE (SELECT E.d FROM @Else AS E) END FROM @Base AS B;
semantics of CASE
是WHEN / ELSE条款以文本顺序进行评估generally.在上面的查询中,如果满足WHEN子句,如果ELSE子查询返回多行,则sql Server返回错误是不正确的.为了尊重这些语义,优化器会生成一个使用传递谓词的计划:
只有当传递谓词返回false时,才会计算嵌套循环连接的内侧.总体效果是CASE表达式按顺序进行测试,并且仅在没有满足先前表达式的情况下才评估子查询.
带有EXISTS子查询的CASE表达式
在CASE子查询使用EXISTS的情况下,逻辑存在测试实现为半连接,但是在后面的子句需要时,必须保留通常被半连接拒绝的行.流经这种特殊类型的半连接的行获取一个标志,以指示半连接是否找到匹配.此标志称为探测列.
实现的细节是逻辑子查询被相关联接(‘apply’)替换为探测列.该工作由查询优化器中的简化规则执行,该规则称为RemoveSubqInPrj(在投影中删除子查询).我们可以使用跟踪标志8606查看详细信息:
SELECT T1.ID,CASE WHEN EXISTS ( SELECT 1 FROM #T2 AS T2 WHERE T2.ID = T1.ID ) THEN 1 ELSE 0 END AS DoesExist FROM #T1 AS T1 WHERE T1.ID BETWEEN 5000 AND 7000 OPTION (QUERYTRACEON 3604,QUERYTRACEON 8606);
显示EXISTS测试的输入树的一部分如下所示:
ScaOp_Exists logop_Project logop_Select logop_Get TBL: #T2 ScaOp_Comp x_cmpEq ScaOp_Identifier [T2].ID ScaOp_Identifier [T1].ID
这由RemoveSubqInPrj转换为以下结构:
logop_Apply (x_jtLeftSemi probe PROBE:COL: Expr1008)
这是左半连接应用前面描述的探针.此初始转换是迄今为止sql Server查询优化器中唯一可用的转换,如果禁用此转换,编译将失败.
此查询的可能执行计划形状之一是该逻辑结构的直接实现:
最终的Compute Scalar使用探测列值评估CASE表达式的结果:
当优化考虑半连接的其他物理连接类型时,将保留计划树的基本形状.只有合并连接支持探测列,因此不考虑逻辑上可能的散列半连接:
请注意,合并输出一个标记为Expr1008的表达式(名称与之前的名称相同是巧合),但计划中的任何运算符都没有定义.这只是探测列.和以前一样,最终的Compute Scalar使用此探测值来评估CASE.
问题是优化器没有完全探索只有合并(或散列)半连接才有用的替代方案.在嵌套循环计划中,检查T2中的行是否与每次迭代的范围匹配没有任何好处.使用合并或哈希计划,这可能是一个有用的优化.
如果我们在查询中向T2添加匹配的BETWEEN谓词,那么所有发生的事情都是对每一行执行此检查作为合并半连接上的残差(很难在执行计划中找到,但它就在那里):
SELECT T1.ID,CASE WHEN EXISTS ( SELECT 1 FROM #T2 AS T2 WHERE T2.ID = T1.ID AND T2.ID BETWEEN 5000 AND 7000 -- New ) THEN 1 ELSE 0 END AS DoesExist FROM #T1 AS T1 WHERE T1.ID BETWEEN 5000 AND 7000;
我们希望BETWEEN谓词会被推迟到T2导致搜索.通常,优化器会考虑这样做(即使查询中没有额外的谓词).它识别隐含的谓词(T1上的BETWEEN和T1和T2之间的连接谓词一起暗示T2上的BETWEEN),而原始查询文本中不存在它们.不幸的是,apply-probe模式意味着没有探索.
有一些方法可以编写查询以在合并半连接的两个输入上产生搜索.一种方法是以非常不自然的方式编写查询(打败我通常更喜欢EXISTS的原因):
WITH T2 AS ( SELECT TOP (9223372036854775807) * FROM #T2 AS T2 WHERE ID BETWEEN 5000 AND 7000 ) SELECT T1.ID,DoesExist = CASE WHEN EXISTS ( SELECT * FROM T2 WHERE T2.ID = T1.ID ) THEN 1 ELSE 0 END FROM #T1 AS T1 WHERE T1.ID BETWEEN 5000 AND 7000;
我不满意在生产环境中编写该查询,只是为了证明所需的计划形状是可能的.如果您需要编写的真实查询以这种特定方式使用CASE,并且性能受到合并半连接的探测端没有搜索的影响,您可以考虑使用产生正确结果的不同语法编写查询更有效的执行计划.