我想在pandas中做一些滚动窗口计算,需要同时处理两列.我将用一个简单的例子来清楚地表达问题:
import pandas as pd
df = pd.DataFrame({
'x': [1,2,3,1,5,4,6,7,9],'y': [4,9,2]
})
windowSize = 4
result = []
for i in range(1,len(df)+1):
if i < windowSize:
result.append(None)
else:
x = df.x.iloc[i-windowSize:i]
y = df.y.iloc[i-windowSize:i]
m = y.mean()
r = sum(x[y > m]) / sum(x[y <= m])
result.append(r)
print(result)
@H_301_7@
有没有办法在没有for pringas循环来解决问题?任何帮助表示赞赏
最佳答案
这是使用NumPy工具的一种矢量化方法 –
windowSize = 4
a = df.values
X = strided_app(a[:,0],windowSize,1)
Y = strided_app(a[:,1],1)
M = Y.mean(1)
mask = Y>M[:,None]
sums = np.einsum('ij,ij->i',X,mask)
rest_sums = X.sum(1) - sums
out = sums/rest_sums
@H_301_7@
strided_app取自here
.
运行时测试 –
方法 –
# @kazemakase's solution
def rolling_window_sum(df,windowSize=4):
rw = rolling_window(df.values.T,windowSize)
m = np.mean(rw[1],axis=-1,keepdims=True)
a = np.sum(rw[0] * (rw[1] > m),axis=-1)
b = np.sum(rw[0] * (rw[1] <= m),axis=-1)
result = a / b
return result
# Proposed in this post
def strided_einsum(df,windowSize=4):
a = df.values
X = strided_app(a[:,1)
Y = strided_app(a[:,1)
M = Y.mean(1)
mask = Y>M[:,None]
sums = np.einsum('ij,mask)
rest_sums = X.sum(1) - sums
out = sums/rest_sums
return out
@H_301_7@
计时 –
In [46]: df = pd.DataFrame(np.random.randint(0,(1000000,2)))
In [47]: %timeit rolling_window_sum(df)
10 loops,best of 3: 90.4 ms per loop
In [48]: %timeit strided_einsum(df)
10 loops,best of 3: 62.2 ms per loop
@H_301_7@
为了获得更多性能,我们可以计算Y.mean(1)部分,它基本上是Scipy's 1D uniform filter
的窗口求和.因此,M可以替代地计算为windowSize = 4 –
from scipy.ndimage.filters import uniform_filter1d as unif1d
M = unif1d(a[:,1].astype(float),windowSize)[2:-1]
@H_301_7@
性能提升显着 –
In [65]: %timeit strided_einsum(df)
10 loops,best of 3: 61.5 ms per loop
In [66]: %timeit strided_einsum_unif_filter(df)
10 loops,best of 3: 49.4 ms per loop
@H_301_7@