我有以下pandas DataFrame:
- import pandas as pd
- import numpy as np
- df = pd.DataFrame({"first_column": [0,1,0]})
- >>> df
- first_column
- 0 0
- 1 0
- 2 0
- 3 1
- 4 1
- 5 1
- 6 0
- 7 0
- 8 1
- 9 1
- 10 0
- 11 0
- 12 0
- 13 0
- 14 1
- 15 1
- 16 1
- 17 1
- 18 1
- 19 0
- 20 0
first_column是0和1的二进制列.存在连续的“簇”,它们总是成对出现至少两个.
我的目标是创建一个列“计算”每组的行数:
- >>> df
- first_column counts
- 0 0 0
- 1 0 0
- 2 0 0
- 3 1 3
- 4 1 3
- 5 1 3
- 6 0 0
- 7 0 0
- 8 1 2
- 9 1 2
- 10 0 0
- 11 0 0
- 12 0 0
- 13 0 0
- 14 1 5
- 15 1 5
- 16 1 5
- 17 1 5
- 18 1 5
- 19 0 0
- 20 0 0
这听起来像df.loc()的工作,例如df.loc [df.first_column == 1] ……某事
我只是不确定如何考虑每个“群集”,以及如何用“行数”标记每个独特的群集.
怎么会这样做?
解决方法
这是NumPy的
cumsum
和
bincount
的一种方法 –
- def cumsum_bincount(a):
- # Append 0 & look for a [0,1] pattern. Form a binned array based off 1s groups
- ids = a*(np.diff(np.r_[0,a])==1).cumsum()
- # Get the bincount,index into the count with ids and finally mask out 0s
- return a*np.bincount(ids)[ids]
样品运行 –
- In [88]: df['counts'] = cumsum_bincount(df.first_column.values)
- In [89]: df
- Out[89]:
- first_column counts
- 0 0 0
- 1 0 0
- 2 0 0
- 3 1 3
- 4 1 3
- 5 1 3
- 6 0 0
- 7 0 0
- 8 1 2
- 9 1 2
- 10 0 0
- 11 0 0
- 12 0 0
- 13 0 0
- 14 1 5
- 15 1 5
- 16 1 5
- 17 1 5
- 18 1 5
- 19 0 0
- 20 0 0
将前6个元素设置为1,然后测试 –
- In [101]: df.first_column.values[:5] = 1
- In [102]: df['counts'] = cumsum_bincount(df.first_column.values)
- In [103]: df
- Out[103]:
- first_column counts
- 0 1 6
- 1 1 6
- 2 1 6
- 3 1 6
- 4 1 6
- 5 1 6
- 6 0 0
- 7 0 0
- 8 1 2
- 9 1 2
- 10 0 0
- 11 0 0
- 12 0 0
- 13 0 0
- 14 1 5
- 15 1 5
- 16 1 5
- 17 1 5
- 18 1 5
- 19 0 0
- 20 0 0