假设我有一个数据帧df如下: –
index company url address
0 A . www.abc.contact.com 16D Bayberry Rd,New Bedford,MA,02740,USA
1 A . www.abc.contact.com . MA,USA
2 A . www.abc.about.com . USA
3 B . www.pqr.com . New Bedford,USA
4 B. www.pqr.com/about . MA,USA
我想从数据框中删除所有行,其中地址是另一个地址的子集,公司是相同的.例如,我希望这5行中的这两行.
index company url address
0 A . www.abc.contact.com 16D Bayberry Rd,USA
3 B . www.pqr.com . New Bedford,USA
最佳答案
也许它不是最佳解决方案,但它可以在这个小型数据框架上工作:
df = pd.DataFrame({"company": ['A','A','B','B'],"address": ['16D Bayberry Rd,USA','MA,'USA','New Bedford,USA']})
# Splitting addresses by column and making sets from every address to use "issubset" later
addresses = list(df['address'].apply(lambda x: set(x.split(','))).values)
companies = list(df['company'].values)
rows_to_drop = [] # Storing row indexes to drop here
# Iterating by every address
for i,(address,company) in enumerate(zip(addresses,companies)):
# Iteraing by the remaining addresses
rem_addr = addresses[:i] + addresses[(i + 1):]
rem_comp = companies[:i] + companies[(i + 1):]
for other_addr,other_comp in zip(rem_addr,rem_comp):
# If address is a subset of another address,add it to drop
if address.issubset(other_addr) and company == other_comp:
rows_to_drop.append(i)
break
df = df.drop(rows_to_drop)
print(df)
company address
0 A 16D Bayberry Rd,USA
3 B New Bedford,USA