这是我的pandas DataFrame:
import pandas as pd
import numpy as np
data = {"column1": [338,519,871,1731,2693,2963,3379,3789,3910,4109,4307,4800,4912,5111,5341,5820,6003,...],"column2": [NaN,1,NaN,2,3,NaN],...}
df = pd.DataFrame(data)
df
>>> column1 column2
0 338 NaN
1 519 1.0
2 871 1.0
3 1731 1.0
4 2693 1.0
5 2963 NaN
6 3379 NaN
7 3789 2.0
8 3910 2.0
9 4109 NaN
10 4307 NaN
11 4800 3.0
12 4912 3.0
13 5111 3.0
14 5341 3.0
15 5820 3.0
16 6003 NaN
17 .... ....
column2中的整数表示column1中的“groups”,例如:行1-4是组“1”,行7-8是组“2”,行11-15是组“3”等.
我想计算每组中第一行和最后一行之间的差异.结果数据框如下所示:
df
>>> column1 column2 column3
0 338 NaN NaN
1 519 1.0 2174
2 871 1.0 2174
3 1731 1.0 2174
4 2693 1.0 2174
5 2963 NaN NaN
6 3379 NaN NaN
7 3789 2.0 121
8 3910 2.0 121
9 4109 NaN NaN
10 4307 NaN NaN
11 4800 3.0 1020
12 4912 3.0 1020
13 5111 3.0 1020
14 5341 3.0 1020
15 5820 3.0 1020
16 6003 NaN NaN
17 .... .... ...
因为:
2693-519 = 2174
3910-3789 = 121
5820-4800 = 1020
计算column3的“熊猫方式”是什么?不知何故,必须遍历column3,寻找连续的值组,例如df.column2!=“NaN”.
编辑:我意识到我的例子可能会引导读者假设第1列中的值只是增加.实际上,有间隔,列间隔
df = pd.DataFrame(data)
df
>>> interval column1 column2
0 interval1 338 NaN
1 interval1 519 1.0
2 interval1 871 1.0
3 interval1 1731 1.0
4 interval1 2693 1.0
5 interval1 2963 NaN
6 interval1 3379 NaN
7 interval1 3789 2.0
8 interval1 3910 2.0
9 interval1 4109 NaN
10 interval1 4307 NaN
11 interval1 4800 3.0
12 interval1 4912 3.0
13 interval1 5111 3.0
14 interval1 5341 3.0
15 interval1 5820 3.0
16 interval1 6003 NaN
17 .... ....
18 interval2 12 13
19 interval2 115 13
20 interval2 275 NaN
....
最佳答案
您可以先过滤,然后在
transform
中获得第一个和最后一个值的差异:
df['col3'] = df[df.column2.notnull()]
.groupby('column2')['column1']
.transform(lambda x: x.iat[-1] - x.iat[0])
print (df)
column1 column2 col3
0 338 NaN NaN
1 519 1.0 2174.0
2 871 1.0 2174.0
3 1731 1.0 2174.0
4 2693 1.0 2174.0
5 2963 NaN NaN
6 3379 NaN NaN
7 3789 2.0 121.0
8 3910 2.0 121.0
9 4109 NaN NaN
10 4307 NaN NaN
11 4800 3.0 1020.0
12 4912 3.0 1020.0
13 5111 3.0 1020.0
14 5341 3.0 1020.0
15 5820 3.0 1020.0
16 6003 NaN NaN
EDIT1由您的新数据:
df['col3'] = df[df.column2.notnull()]
.groupby('column2')['column1']
.transform(lambda x: x.iat[-1] - x.iat[0])
print (df)
interval column1 column2 col3
0 interval1 338 NaN NaN
1 interval1 519 1.0 2174.0
2 interval1 871 1.0 2174.0
3 interval1 1731 1.0 2174.0
4 interval1 2693 1.0 2174.0
5 interval1 2963 NaN NaN
6 interval1 3379 NaN NaN
7 interval1 3789 2.0 121.0
8 interval1 3910 2.0 121.0
9 interval1 4109 NaN NaN
10 interval1 4307 NaN NaN
11 interval1 4800 3.0 1020.0
12 interval1 4912 3.0 1020.0
13 interval1 5111 3.0 1020.0
14 interval1 5341 3.0 1020.0
15 interval1 5820 3.0 1020.0
16 interval1 6003 NaN NaN
18 interval2 12 13.0 103.0
19 interval2 115 13.0 103.0
20 interval2 275 NaN NaN