我有两个使用相同分区方案分区的表.当在用于分区的列上将它们连接在一起时,我注意到sql Server无法像人们期望的那样优化并行合并连接,因此选择使用HASH JOIN.在这种特殊情况下,我可以通过基于分区函数将查询拆分为10个不相交的范围并在SSMS中同时运行每个查询来手动模拟更加优化的并行MERGE JOIN.使用WAITFOR在完全相同的时间运行所有查询,结果是所有查询在原始并行HASH JOIN使用的总时间的约40%内完成.
在等效分区表的情况下,有没有办法让sql Server自己进行这种优化?我知道sql Server通常会产生大量开销以使MERGE JOIN并行,但似乎有一种非常自然的分片方法,在这种情况下开销最小.也许只是一个专门的案例,优化器还不够聪明才能识别?
以下是设置简化数据集以重现此问题的sql:
/* Create the first test data table */ CREATE TABLE test_transaction_properties ( transactionID INT NOT NULL IDENTITY(1,1),prop1 INT NULL,prop2 FLOAT NULL ) /* Populate table with pseudo-random data (the specific data doesn't matter too much for this example) */ ;WITH E1(N) AS ( SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 ),E2(N) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b),E4(N) AS (SELECT 1 FROM E2 a CROSS JOIN E2 b),E8(N) AS (SELECT 1 FROM E4 a CROSS JOIN E4 b) INSERT INTO test_transaction_properties WITH (TABLOCK) (prop1,prop2) SELECT TOP 10000000 (ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)) % 5) + 1 AS prop1,ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)) * rand() AS prop2 FROM E8 /* Create the second test data table */ CREATE TABLE test_transaction_item_detail ( transactionID INT NOT NULL,productID INT NOT NULL,sales FLOAT NULL,units INT NULL ) /* Populate the second table such that each transaction has one or more items (again,the specific data doesn't matter too much for this example) */ INSERT INTO test_transaction_item_detail WITH (TABLOCK) (transactionID,productID,sales,units) SELECT t.transactionID,p.productID,100 AS sales,1 AS units FROM test_transaction_properties t JOIN ( SELECT 1 as productRank,1 as productId UNION ALL SELECT 2 as productRank,12 as productId UNION ALL SELECT 3 as productRank,123 as productId UNION ALL SELECT 4 as productRank,1234 as productId UNION ALL SELECT 5 as productRank,12345 as productId ) p ON p.productRank <= t.prop1 /* Divides the transactions evenly into 10 partitions */ CREATE PARTITION FUNCTION [pf_test_transactionId] (INT) AS RANGE RIGHT FOR VALUES (1,1000001,2000001,3000001,4000001,5000001,6000001,7000001,8000001,9000001) CREATE PARTITION SCHEME [ps_test_transactionId] AS PARTITION [pf_test_transactionId] ALL TO ( [PRIMARY] ) /* Apply the same partition scheme to both test data tables */ ALTER TABLE test_transaction_properties ADD CONSTRAINT PK_test_transaction_properties PRIMARY KEY (transactionID) ON ps_test_transactionId (transactionID) ALTER TABLE test_transaction_item_detail ADD CONSTRAINT PK_test_transaction_item_detail PRIMARY KEY (transactionID,productID) ON ps_test_transactionId (transactionID)
现在我们终于准备重现次优查询了!
/* This query produces a HASH JOIN using 20 threads without the MAXDOP hint,and the same behavior holds in that case. For simplicity here,I have limited it to 10 threads. */ SELECT COUNT(*) FROM test_transaction_item_detail i JOIN test_transaction_properties t ON t.transactionID = i.transactionID OPTION (MAXDOP 10)
但是,使用单个线程来处理每个分区(下面的第一个分区的示例)将导致更有效的计划.我通过在恰好相同的时刻为10个分区中的每个分区运行如下所示的查询来测试这个,并且所有10个在1秒钟内完成:
SELECT COUNT(*) FROM test_transaction_item_detail i INNER MERGE JOIN test_transaction_properties t ON t.transactionID = i.transactionID WHERE t.transactionID BETWEEN 1 AND 1000000 OPTION (MAXDOP 1)
解决方法
通过在每个线程上运行整个查询的多个实例,每个线程处理一个独占的数据范围,可以避免这些问题.但是,这不是优化程序本身考虑的策略.实际上,并行的原始sql Server模型在交换时中断了查询,并运行由多个线程上的这些拆分形成的计划段.
有一些方法可以在独占数据集范围内实现在多个线程上运行整个查询计划,但是它们需要不是每个人都满意的技巧(并且不会得到Microsoft支持或保证将来可以工作).一种这样的方法是迭代分区表的分区,并为每个线程提供生成小计的任务.结果是每个独立线程返回的行计数的SUM:
从元数据中获取分区号非常简单:
DECLARE @P AS TABLE ( partition_number integer PRIMARY KEY ); INSERT @P (partition_number) SELECT p.partition_number FROM sys.partitions AS p WHERE p.[object_id] = OBJECT_ID(N'test_transaction_properties',N'U') AND p.index_id = 1;
然后我们使用这些数字来驱动相关联接(APPLY),并使用$PARTITION函数将每个线程限制为当前分区号:
SELECT row_count = SUM(Subtotals.cnt) FROM @P AS p CROSS APPLY ( SELECT cnt = COUNT_BIG(*) FROM dbo.test_transaction_item_detail AS i JOIN dbo.test_transaction_properties AS t ON t.transactionID = i.transactionID WHERE $PARTITION.pf_test_transactionId(t.transactionID) = p.partition_number AND $PARTITION.pf_test_transactionId(i.transactionID) = p.partition_number ) AS SubTotals;
查询计划显示正在为表@P中的每一行执行MERGE连接.聚簇索引扫描属性确认每次迭代仅处理单个分区:
不幸的是,这只会导致分区的顺序串行处理.在您提供的数据集上,我的4核(超线程到8)笔记本电脑在7秒内返回正确的结果,所有数据都在内存中.
为了使MERGE子计划同时运行,我们需要一个并行计划,其中分区ID分布在可用线程(MAXDOP)上,并且每个MERGE子计划使用一个分区中的数据在单个线程上运行.不幸的是,优化器经常在成本基础上决定不使用并行MERGE,并且没有记录的强制并行计划的方法.使用trace flag 8649有一种无证(并且不受支持)的方式:
SELECT row_count = SUM(Subtotals.cnt) FROM @P AS p CROSS APPLY ( SELECT cnt = COUNT_BIG(*) FROM dbo.test_transaction_item_detail AS i JOIN dbo.test_transaction_properties AS t ON t.transactionID = i.transactionID WHERE $PARTITION.pf_test_transactionId(t.transactionID) = p.partition_number AND $PARTITION.pf_test_transactionId(i.transactionID) = p.partition_number ) AS SubTotals OPTION (QUERYTRACEON 8649);
现在查询计划显示来自@P的分区号在循环的基础上分布在线程中.每个线程为单个分区运行嵌套循环连接的内侧,实现了我们同时处理不相交数据的目标.现在,我的8个超级核心在3秒内返回相同的结果,所有8个都在100%利用率.
我不建议你必须使用这种技术 – 请参阅我之前的警告 – 但它确实解决了你的问题:)
看到您正在使用sql Server 2012(并假设它是Enterprise),您还可以选择使用列存储索引.这显示了批处理模式散列连接的可能性,其中有足够的内存可用:
CREATE NONCLUSTERED COLUMNSTORE INDEX cs ON dbo.test_transaction_properties (transactionID); CREATE NONCLUSTERED COLUMNSTORE INDEX cs ON dbo.test_transaction_item_detail (transactionID);
使用这些索引查询…
SELECT COUNT_BIG(*) FROM dbo.test_transaction_properties AS ttp JOIN dbo.test_transaction_item_detail AS ttid ON ttid.transactionID = ttp.transactionID;
…从优化器得到以下执行计划,没有任何欺骗:
在2秒内更正结果,但消除标量聚合的行模式处理有助于实现更多:
SELECT COUNT_BIG(*) FROM dbo.test_transaction_properties AS ttp JOIN dbo.test_transaction_item_detail AS ttid ON ttid.transactionID = ttp.transactionID GROUP BY ttp.transactionID % 1;
优化的列存储查询在851ms内运行.
保罗怀特