sum(count (distinct (customers)))
如果我单独运行计数,结果将是:
Day | CountCustomers ---------------------- 5/1 | 1 5/2 | 0 5/3 | 5
我需要输出总和为:
Day | RunningTotalCustomers ---------------------- 5/1 | 1 5/2 | 1 5/3 | 6
我在使用coalesce方法之前已经完成了运行总计,但从未计算过.我现在不知道如何做到这一点.
解决方法
CREATE TABLE dbo.Hits(Day SMALLDATETIME,CustomerID INT); CREATE CLUSTERED INDEX x ON dbo.Hits([Day]); INSERT dbo.Hits SELECT TOP (5000) DATEADD(DAY,r,'20120501'),COALESCE(ASCII(SUBSTRING(name,s,1)),86) FROM (SELECT name,r = ROW_NUMBER() OVER (ORDER BY name)/10,s = CONVERT(INT,RIGHT(CONVERT(VARCHAR(20),[object_id]),1)) FROM sys.all_columns) AS x; SELECT Earliest_Day = MIN([Day]),Latest_Day = MAX([Day]),Unique_Days = DATEDIFF(DAY,MIN([Day]),MAX([Day])) + 1,Total_Rows = COUNT(*) FROM dbo.Hits;
结果:
Earliest_Day Latest_Day Unique_Days Total_Days ------------------- ------------------- ----------- ---------- 2012-05-01 00:00:00 2013-09-13 00:00:00 501 5000
数据看起来像这样(5000行) – 但在您的系统上看起来会略有不同,具体取决于版本和构建#:
Day CustomerID ------------------- --- 2012-05-01 00:00:00 95 2012-05-01 00:00:00 97 2012-05-01 00:00:00 97 2012-05-01 00:00:00 117 2012-05-01 00:00:00 100 ... 2012-05-02 00:00:00 110 2012-05-02 00:00:00 110 2012-05-02 00:00:00 95 ...
并且运行总计结果应该如下所示(501行):
Day c rt ------------------- -- -- 2012-05-01 00:00:00 6 6 2012-05-02 00:00:00 5 11 2012-05-03 00:00:00 4 15 2012-05-04 00:00:00 7 22 2012-05-05 00:00:00 6 28 ...
所以我要比较的方法是:
>“自我加入” – 基于集合的纯粹主义方法
>“带日期的递归CTE” – 这取决于连续的日期(无间隙)
>“带有row_number的递归CTE” – 类似于上面但更慢,依赖于ROW_NUMBER
>“带#temp表的递归CTE” – 根据建议从Mikael的回答中窃取
>“quirky update”虽然不受支持且没有前途定义的行为,但似乎很受欢迎
>“光标”
> sql Server 2012使用新的窗口功能
自联接
这是人们在警告你远离游标时告诉你这样做的方式,因为“基于集合总是更快”.在最近的一些实验中,我发现光标超越了这个解决方案.
;WITH g AS ( SELECT [Day],c = COUNT(DISTINCT CustomerID) FROM dbo.Hits GROUP BY [Day] ) SELECT g.[Day],g.c,rt = SUM(g2.c) FROM g INNER JOIN g AS g2 ON g.[Day] >= g2.[Day] GROUP BY g.[Day],g.c ORDER BY g.[Day];
带日期的递归cte
提醒 – 这依赖于连续日期(无间隙),最多10000级递归,并且您知道您感兴趣的范围的开始日期(设置锚点).当然,您可以使用子查询动态设置锚点,但我想保持简单.
;WITH g AS ( SELECT [Day],c = COUNT(DISTINCT CustomerID) FROM dbo.Hits GROUP BY [Day] ),x AS ( SELECT [Day],c,rt = c FROM g WHERE [Day] = '20120501' UNION ALL SELECT g.[Day],x.rt + g.c FROM x INNER JOIN g ON g.[Day] = DATEADD(DAY,1,x.[Day]) ) SELECT [Day],rt FROM x ORDER BY [Day] OPTION (MAXRECURSION 10000);
与row_number的递归cte
Row_number计算在这里略贵.同样,这支持最大递归级别10000,但您不需要分配锚点.
;WITH g AS ( SELECT [Day],rn = ROW_NUMBER() OVER (ORDER BY DAY),rn,rt = c FROM g WHERE rn = 1 UNION ALL SELECT g.[Day],g.rn,x.rt + g.c FROM x INNER JOIN g ON g.rn = x.rn + 1 ) SELECT [Day],rt FROM x ORDER BY [Day] OPTION (MAXRECURSION 10000);
带临时表的递归cte
正如建议的那样,从Mikael的答案中窃取,将其包含在测试中.
CREATE TABLE #Hits ( rn INT PRIMARY KEY,c INT,[Day] SMALLDATETIME ); INSERT INTO #Hits (rn,Day) SELECT ROW_NUMBER() OVER (ORDER BY DAY),COUNT(DISTINCT CustomerID),[Day] FROM dbo.Hits GROUP BY [Day]; WITH x AS ( SELECT [Day],rt = c FROM #Hits as c WHERE rn = 1 UNION ALL SELECT g.[Day],x.rt + g.c FROM x INNER JOIN #Hits as g ON g.rn = x.rn + 1 ) SELECT [Day],rt FROM x ORDER BY [Day] OPTION (MAXRECURSION 10000); DROP TABLE #Hits;
古怪的更新
我再一次只是为了完整而包括这个;我个人不会依赖这个解决方案,因为正如我在另一个答案中提到的,这种方法根本不能保证工作,并且可能在sql Server的未来版本中完全破坏. (我正在尽力强迫sql Server遵守我想要的顺序,使用提示进行索引选择.)
CREATE TABLE #x([Day] SMALLDATETIME,rt INT); CREATE UNIQUE CLUSTERED INDEX x ON #x([Day]); INSERT #x([Day],c) SELECT [Day],c = COUNT(DISTINCT CustomerID) FROM dbo.Hits GROUP BY [Day] ORDER BY [Day]; DECLARE @rt1 INT; SET @rt1 = 0; UPDATE #x SET @rt1 = rt = @rt1 + c FROM #x WITH (INDEX = x); SELECT [Day],rt FROM #x ORDER BY [Day]; DROP TABLE #x;
光标
“小心,这里有游标!游标是邪恶的!你应该不惜一切代价避免使用游标!”不,那不是我说话,这只是我听到的很多东西.与流行的观点相反,在某些情况下游标是合适的.
CREATE TABLE #x2([Day] SMALLDATETIME,rt INT); CREATE UNIQUE CLUSTERED INDEX x ON #x2([Day]); INSERT #x2([Day],COUNT(DISTINCT CustomerID) FROM dbo.Hits GROUP BY [Day] ORDER BY [Day]; DECLARE @rt2 INT,@d SMALLDATETIME,@c INT; SET @rt2 = 0; DECLARE c CURSOR LOCAL STATIC READ_ONLY FORWARD_ONLY FOR SELECT [Day],c FROM #x2 ORDER BY [Day]; OPEN c; FETCH NEXT FROM c INTO @d,@c; WHILE @@FETCH_STATUS = 0 BEGIN SET @rt2 = @rt2 + @c; UPDATE #x2 SET rt = @rt2 WHERE [Day] = @d; FETCH NEXT FROM c INTO @d,@c; END SELECT [Day],rt FROM #x2 ORDER BY [Day]; DROP TABLE #x2;
sql Server 2012
如果您使用的是最新版本的sql Server,对窗口功能的增强使我们可以轻松计算运行总计而无需自加入的指数成本(SUM在一次通过中计算),CTE的复杂性(包括要求)对于表现更好的CTE的连续行,不支持的古怪更新和禁止的游标.只是要警惕使用RANGE和ROWS之间的区别,或者根本不指定 – 只有ROWS避免了磁盘上的磁盘,否则会严重影响性能.
;WITH g AS ( SELECT [Day],rt = SUM(c) OVER (ORDER BY [Day] ROWS UNBOUNDED PRECEDING) FROM g ORDER BY g.[Day];
性能比较
SELECT SYSUTCDATETIME(); GO DBCC DROPCLEANBUFFERS;DBCC FREEPROCCACHE; -- query here GO 10 SELECT SYSUTCDATETIME();
以下是总持续时间的结果,以毫秒为单位(请记住,这也包括每次DBCC命令):
method run 1 run 2 ----------------------------- -------- -------- self-join 1296 ms 1357 ms -- "supported" non-sql 2012 winner recursive cte with dates 1655 ms 1516 ms recursive cte with row_number 19747 ms 19630 ms recursive cte with #temp table 1624 ms 1329 ms quirky update 880 ms 1030 ms -- non-sql 2012 winner cursor 1962 ms 1850 ms sql Server 2012 847 ms 917 ms -- winner if sql 2012 available
我没有DBCC命令就重新做了:
method run 1 run 2 ----------------------------- -------- -------- self-join 1272 ms 1309 ms -- "supported" non-sql 2012 winner recursive cte with dates 1247 ms 1593 ms recursive cte with row_number 18646 ms 18803 ms recursive cte with #temp table 1340 ms 1564 ms quirky update 1024 ms 1116 ms -- non-sql 2012 winner cursor 1969 ms 1835 ms sql Server 2012 600 ms 569 ms -- winner if sql 2012 available
删除DBCC和循环,只测量一个原始迭代:
method run 1 run 2 ----------------------------- -------- -------- self-join 313 ms 242 ms recursive cte with dates 217 ms 217 ms recursive cte with row_number 2114 ms 1976 ms recursive cte with #temp table 83 ms 116 ms -- "supported" non-sql 2012 winner quirky update 86 ms 85 ms -- non-sql 2012 winner cursor 1060 ms 983 ms sql Server 2012 68 ms 40 ms -- winner if sql 2012 available
最后,我将源表中的行数乘以10(将top更改为50000并将另一个表添加为交叉连接).结果,一次迭代没有DBCC命令(只是为了时间):
method run 1 run 2 ----------------------------- -------- -------- self-join 2401 ms 2520 ms recursive cte with dates 442 ms 473 ms recursive cte with row_number 144548 ms 147716 ms recursive cte with #temp table 245 ms 236 ms -- "supported" non-sql 2012 winner quirky update 150 ms 148 ms -- non-sql 2012 winner cursor 1453 ms 1395 ms sql Server 2012 131 ms 133 ms -- winner
我只测量了持续时间 – 我将把它作为练习留给读者来比较这些方法的数据,比较其他可能很重要的指标(或者可能随其架构/数据而变化).在从这个答案中得出任何结论之前,你应该根据你的数据和模式对它进行测试……这些结果几乎肯定会随着行数的增加而改变.
演示
结论
在我的测试中,选择是:
> sql Server 2012方法,如果我有sql Server 2012可用.
>如果sql Server 2012不可用,并且我的日期是连续的,我会使用带有日期方法的递归cte.
>如果1.和2.都不适用,即使表现接近,我也会选择自我加入而不是表现,因为行为是记录在案并保证的.我不太担心未来的兼容性,因为希望如果古怪的更新中断它将在我已经将我的所有代码转换为1. :-)之后
但是,您应该再次针对您的架构和数据进行测试.由于这是一项行人数相对较低的人为测试,因此它可能是一种放屁.我已经用不同的模式和行数进行了其他测试,并且性能启发式方法完全不同……这就是为什么我对原始问题提出了如此多的后续问题.
UPDATE
我在这里写了更多关于此的博文:
Best approaches for running totals – updated for SQL Server 2012