sql-server – 为什么LEN()函数严重低估了SQL Server 2014中的基数？

我有一个带有字符串列的表和一个检查具有一定长度的行的谓词.在sql Server 2014中,无论我检查的长度如何,我都会看到1行的估计值.这产生了非常糟糕的计划,因为实际上有数千甚至数百万行,sql Server正在选择将此表放在嵌套循环的外侧.

是否有sql Server 2014的基数估计值为1.0003,而sql Server 2012估计为31,622行？有一个很好的解决方法吗？

以下是该问题的简短复制：

-- Create a table with 1MM rows of dummy data
CREATE TABLE #customers (cust_nbr VARCHAR(10) NOT NULL)
GO

INSERT INTO #customers WITH (TABLOCK) (cust_nbr)
    SELECT TOP 1000000 
        CONVERT(VARCHAR(10),ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) AS cust_nbr
    FROM master..spt_values v1
    CROSS JOIN master..spt_values v2
GO

-- Looking for string of a certain length.
-- While both CEs yield fairly poor estimates,the 2012 CE is much
-- more conservative (higher estimate) and therefore much more likely
-- to yield an okay plan rather than a drastically understimated loop join.
-- 2012: 31,622 rows estimated,900K rows actual
-- 2014: 1 row estimated,900K rows actual
SELECT COUNT(*)
FROM #customers
WHERE LEN(cust_nbr) = 6
OPTION (QUERYTRACEON 9481) -- Optionally,use 2012 CE
GO

Here is a more complete script showing additional tests

我也读过whitepaper on the SQL Server 2014 Cardinality Estimator,但没有找到任何澄清情况的东西.

解决方法

对于遗留CE,我看到估计是行的3.16228％ – 这是用于column = literal谓词的“幻数”启发式(还有其他基于谓词构造的启发式方法 – 但LEN包裹在列的周围遗留的CE结果与此猜测框架相匹配).你可以在Joe Sack的 Selectivity Guesses in absence of Statistics和Ian Jose的 Constant-Constant Comparison Estimation的帖子上看到这个例子.

-- Legacy CE: 31622.8 rows
SELECT  COUNT(*)
FROM    #customers
WHERE   LEN(cust_nbr) = 6
OPTION  ( QUERYTRACEON 9481); -- Legacy CE
GO

现在,对于新的CE行为,它看起来现在对优化器可见(这意味着我们可以使用统计信息).我查看了下面的计算器输出,你可以查看相关的自动生成统计数据作为指针：

-- New CE: 1.00007 rows
SELECT  COUNT(*)
FROM    #customers
WHERE   LEN(cust_nbr) = 6
OPTION  ( QUERYTRACEON 2312 ); -- New CE
GO

-- View New CE behavior with 2363 (for supported option use XEvents)
SELECT  COUNT(*)
FROM    #customers
WHERE   LEN(cust_nbr) = 6
OPTION  (QUERYTRACEON 2312,QUERYTRACEON 2363,QUERYTRACEON 3604,RECOMPILE); -- New CE
GO

/*
Loaded histogram for column QCOL:
[tempdb].[dbo].[#customers].cust_nbr from stats with id 2
Using ambient cardinality 1e+006 to combine distinct counts:
  999927

Combined distinct count: 999927
Selectivity: 1.00007e-006
Stats collection generated:
  CStCollFilter(ID=2,CARD=1.00007)
      CStCollBaseTable(ID=1,CARD=1e+006 TBL: #customers)

End selectivity computation
*/

EXEC tempdb..sp_helpstats '#customers';


--Check out AVG_RANGE_ROWS values (for example - plenty of ~ 1)
DBCC SHOW_STATISTICS('tempdb..#customers','_WA_Sys_00000001_B0368087');
--That's my Stats name yours is subject to change

遗憾的是,逻辑依赖于对不同值的数量的估计,而不是针对LEN函数的影响进行调整.

可能的解决方法

通过将LEN重写为LIKE,您可以在两种CE模型下获得基于trie的估计：

SELECT COUNT_BIG(*)
FROM #customers AS C
WHERE C.cust_nbr LIKE REPLICATE('_',6);

有关使用跟踪标志的信息：

> 2363：显示大量信息,包括正在加载的统计信息.> 3604：将DBCC命令的输出打印到消息选项卡.

sql-server – 为什么LEN()函数严重低估了SQL Server 2014中的基数？

解决方法

猜你在找的MsSQL相关文章