我想从T-sql中给定数据表的每行的均值分布生成一个随机值,其中mean = 0,标准devation = 1.另外,我想设置一个种子,以确保分析的重现性.以下是不起作用的想法:
>使用具有声明数字的函数RAND()不符合此目标:为数据集的每一行生成相同的随机值.
>这样的解决方案:
SELECT ABS(CAST(CAST(NEWID()AS VARBINARY)AS INT))AS [RandomNumber]
由于不可重复,因此也不能解决问题.
编辑:
表现很重要,因为我的桌子有数亿条记录.
解决方法
这里的主要问题IMHO是怎么看到“可重复性”?或者不同的问题:什么驱动着随机性?只要数据不变,我可以设想一个解决方案,每个运行的每个记录都附加相同的随机数.但是,如果数据发生变化,预计会发生什么?
为了乐趣,我对一个(不是非常有代表性的)100万行的测试表进行了以下测试:
-- seed SELECT Rand(0) -- will show the same random number for EVERY record SELECT Number,blah = Convert(varchar(100),NewID()),random = Rand() INTO #test FROM master.dbo.fn_int_list(1,1000000) CREATE UNIQUE CLUSTERED INDEX uq0_test ON #test (Number) SET NOCOUNT ON GO DECLARE @start_time datetime = CURRENT_TIMESTAMP,@c_number int -- update each record (one by one) and set the random number based on 'the next Rand()' value -- => the order of the records drives the distribution of the Rand() value ! -- seed SELECT @c_number = Rand(0) -- update 1 by 1 DECLARE cursor_no_transaction CURSOR LOCAL STATIC FOR SELECT Number FROM #test ORDER BY Number OPEN cursor_no_transaction FETCH NEXT FROM cursor_no_transaction INTO @c_number WHILE @@FETCH_STATUS = 0 BEGIN UPDATE #test SET random = Rand() WHERE Number = @c_number FETCH NEXT FROM cursor_no_transaction INTO @c_number END CLOSE cursor_no_transaction DEALLOCATE cursor_no_transaction PRINT 'Time needed (no transaction) : ' + Convert(nvarchar(100),DateDiff(ms,@start_time,CURRENT_TIMESTAMP)) + ' ms.' SELECT _avg = AVG(random),_stdev = STDEV(random) FROM #test GO DECLARE @start_time datetime = CURRENT_TIMESTAMP,@c_number int BEGIN TRANSACTION -- update each record (one by one) and set the random number based on 'the next Rand()' value -- => the order of the records drives the distribution of the Rand() value ! -- seed SELECT @c_number = Rand(0) -- update 1 by 1 but all of it inside 1 single transaction DECLARE cursor_single_transaction CURSOR LOCAL STATIC FOR SELECT Number FROM #test ORDER BY Number OPEN cursor_single_transaction FETCH NEXT FROM cursor_single_transaction INTO @c_number WHILE @@FETCH_STATUS = 0 BEGIN UPDATE #test SET random = Rand() WHERE Number = @c_number FETCH NEXT FROM cursor_single_transaction INTO @c_number END CLOSE cursor_single_transaction DEALLOCATE cursor_single_transaction COMMIT TRANSACTION PRINT 'Time needed (single transaction) : ' + Convert(nvarchar(100),_stdev = STDEV(random) FROM #test GO DECLARE @start_time datetime = CURRENT_TIMESTAMP -- update each record (single operation),use the Number column to reseed the Rand() function for every record UPDATE #test SET random = Rand(Number) PRINT 'Time needed Rand(Number) : ' + Convert(nvarchar(100),use 'a bunch of fields' to reseed the Rand() function for every record UPDATE #test SET random = Rand(BINARY_CHECKSUM(Number,blah)) PRINT 'Time needed Rand(BINARY_CHECKSUM(Number,blah)) : ' + Convert(nvarchar(100),_stdev = STDEV(random) FROM #test
结果或多或少是如预期的:
Time needed (no transaction) : 24570 ms. _avg _stdev ---------------------- ---------------------- 0.499630943538644 0.288686960086461 Time needed (single transaction) : 14813 ms. _avg _stdev ---------------------- ---------------------- 0.499630943538646 0.288686960086461 Time needed Rand(Number) : 1203 ms. _avg _stdev ---------------------- ---------------------- 0.499407423620328 0.291093824839539 Time needed Rand(BINARY_CHECKSUM(Number,blah)) : 1250 ms. _avg _stdev ---------------------- ---------------------- 0.499715398881586 0.288579510523627
所有这些都是“可重复的”,问题是“可重复”是指你想要的意思.我已经坚持使用AVG()和STDEV()来获取一个粗略的分布概念,我会留给你看看他们是否真的适合账单(如果不是,如何改进它)
1百万行的1.2秒对于100万行IMHO来说听起来不太好.也就是说,如果您的表格包含额外的列,它将占用更多的空间,因此需要更多的时间!
希望这能让你开始…