样本DF:
ID Name Price Condition Fit_Test
1 Apple 10 Good Super_Fit
2 Apple 10 OK Super_Fit
3 Apple 10 Bad Super_Fit
4 Orange 12 Good Not_Fit
5 Orange 12 OK Not_Fit
6 Banana 15 OK Medium_Fit
7 Banana 15 Bad Medium_Fit
8 Pineapple 25 OK Medium_Fit
9 Pineapple 25 OK Medium_Fit
10 Cherry 30 Bad Medium_Fit
预期DF:
ID Name Price Condition Fit_Test
1 Apple 10 Good Super_Fit
2 Apple 10 OK Super_Fit
3 Apple 10 Bad Super_Fit
4 Orange 12 Good Not_Fit
6 Banana 15 OK Medium_Fit
8 Pineapple 25 OK Medium_Fit
9 Pineapple 25 OK Medium_Fit
10 Cherry 30 Bad Medium_Fit
问题陈述:
我想按名称和价格分组,然后根据条件进行过滤.
>如果在名称和价格中存在良好,不良和确定这三个条件,则仅保留一个良好且Fit_Test不是Super_Fit
>如果在“名称”和“价格”中存在“良好”和“确定”的条件,则仅保留“良好”(ID 4,5仅是预期的ID 4),而Fit_Test不是Super_Fit
>如果在“名称”和“价格”中存在“坏”和“确定”的条件,则仅保留“确定”(ID 6,7仅是预期的ID 6),而Fit_Test不是Super_Fit
>如果在“名称”和“价格”中存在“确定”和“确定”的条件,则存在“良好”和“良好存在”或“不良”,则不执行任何操作,然后仅保留“确定”(期望的ID 8,9,10是ID 8,10)并且Fit_Test不是Super_Fit
更新答案
>测试的第一个答案和编辑适用于所有没有Fit_Test列条件的df.在此答案中,预期DF将没有第2行和第2行. 3也如答案所示
>“更新更新”答案在需要添加其他列Fit_Test时有效,并且仅在值不为Super_Fit时才有效.
在这两种解决方案中,基于“条件”列和“ 2列分组”的行过滤是相同的.
我在数字列上找到了带有筛选器分组依据的东西,但在字符串列上却找不到.
最佳答案
想法是创建集合以进行比较:
a = df.join(df.groupby(['Price','Name'])['Condition'].apply(set).rename('m'),on=['Price','Name'])['m']
print (a)
0 {Bad,Good,OK}
1 {Bad,OK}
2 {Bad,OK}
3 {Good,OK}
4 {Good,OK}
5 {Bad,OK}
6 {Bad,OK}
7 {OK}
8 {OK}
9 {Bad}
Name: m,dtype: object
m1 = (a == set({'Bad','Good','OK'})) | (a == set({'Good','OK'}))
m2 = a == set({'Bad','OK'})
#check if unique value - length of set is 1
m3 = a.str.len() == 1
m4 = df['Condition'] == 'Good'
m5 = df['Condition'] == 'OK'
df = df[(m1 & m4) | (m2 & m5) | m3]
print (df)
ID Name Price Condition
0 1 Apple 10 Good
3 4 Orange 12 Good
5 6 Banana 15 OK
7 8 Pineapple 25 OK
8 9 Pineapple 25 OK
9 10 Cherry 30 Bad
编辑测试:
为了进行测试,可以使用assign:
print (df.assign(sets=a,m1 = m1,m2=m2,m3=m3,m4=m4,m5=m5,m=m))
ID Name Price Condition sets m1 m2 m3 \
0 1 Apple 10 Good {Bad,OK} True False False
1 2 Apple 10 OK {Bad,OK} True False False
2 3 Apple 10 Bad {Bad,OK} True False False
3 4 Orange 12 Good {Good,OK} True False False
4 5 Orange 12 OK {Good,OK} True False False
5 6 Banana 15 OK {Bad,OK} False True False
6 7 Banana 15 Bad {Bad,OK} False True False
7 8 Pineapple 25 OK {OK} False False True
8 9 Pineapple 25 OK {OK} False False True
9 10 Cherry 30 Bad {Bad} False False True
m4 m5 m
0 True False True
1 False True False
2 False False False
3 True False True
4 False True False
5 False True True
6 False False False
7 False True True
8 False True True
9 False False True
编辑更新:
对于新条件,请使用:
m6 = df['Fit_Test'] == 'Super_Fit'
df = df[((m1 & m4) | (m2 & m5) | m3) | m6]
print (df)
ID Name Price Condition Fit_Test
0 1 Apple 10 Good Super_Fit
1 2 Apple 10 OK Super_Fit
2 3 Apple 10 Bad Super_Fit
3 4 Orange 12 Good Not_Fit
5 6 Banana 15 OK Medium_Fit
7 8 Pineapple 25 OK Medium_Fit
8 9 Pineapple 25 OK Medium_Fit
9 10 Cherry 30 Bad Medium_Fit