我有两个熊猫数据框,每个都有不同的大小,每个记录超过100万条.
我希望比较这两个数据框并找出差异.
数据框
ID Name Age Sex
1A1 Cling 21 M
1B2 Roger 22 M
1C3 Stew 23 M
数据框
ID FullName Gender Age
1B2 Roger M 21
1C3 Rick M 23
1D4 Ash F 21
DataFrameB将始终具有比DataFrameA多的记录,但是在DataFrameA中找到的记录可能仍然不在DataFrameB中.
DataFrameA和DataFrameB中的列名称不同.我将映射存储在另一个数据框中.
MappingDataFrame
DataFrameACol DataFrameBCol
ID ID
Name FullName
Age Age
Sex Gender
我正在寻找比较这两者,并在结果旁边添加一列.
DataFrame A的列名加法器=“ _A_Txt”
数据框B的列名加法器=“ _B_Txt”
预期产量
ID Name_A_Txt FullName_B_Text Result_Name Age_A_Txt Age_B_Txt Result_Age
1B2 Roger Roger Match ... ...
1C3 Stew Rick No Match ... ...
我目前正在使用For循环来构建此逻辑.但是,要完成一百万条记录需要花费很多时间.我让程序运行了超过50分钟,但并没有像实时那样完成,我正在为100多个列构建程序.
我将为这个问题打开赏金并授予赏金,即使在打开它作为奖励之前已经回答了该问题.因为,我一直在为使用For循环迭代而为性能而苦苦挣扎.
要以DataFrameA和DataFrameB开头,请使用以下代码,
import pandas as pd
d = {
'ID':['1A1','1B2','1C3'],'Name':['Cling','Roger','Stew'],'Age':[21,22,23],'Sex':['M','M','M']
}
DataFrameA = pd.DataFrame(d)
d = {
'ID':['1B2','1C3','1D4'],'FullName':['Roger','Rick','Ash'],'Gender':['M','F'],23,21]
}
DataFrameB = pd.DataFrame(d)
我相信,这个问题与Coldspeed提供的建议(对联接的定义)有些不同,因为它还涉及查找不同的列名并添加新的结果列.同样,列名需要在结果端进行转换.
OutputDataFrame如下所示,
为了更好地理解读者,我在专栏中
按行中的名称
Col 1 - ID (Coming from DataFrameA)
Col 2 - Name_X (Coming from DataFrameA)
Col 3 - FullName_Y (Coming from DataFrameB)
Col 4 - Result_Name (Name is what is there in DataFrameA and this is a comparison between Name_X and FullName_Y)
Col 5 - Age_X (Coming from DataFrameA)
Col 6 - Age_Y (Coming From DataFrameB)
Col 7 - Result_Age (Age is what is there in DataFrameA and this is a result between Age_X and Age_Y)
Col 8 - Sex_X (Coming from DataFrameA)
Col 9 - Gender_Y (Coming from DataFrameB)
Col 10 - Result_Sex (Sex is what is there in DataFrameA and this is a result between Sex_X and Gender_Y)
最佳答案
m = list(mapping_df.set_index('DataFrameACol')['DataFrameBCol']
.drop('ID')
.iteritems())
m[m.index(('Age','Age'))] = ('Age_x','Age_y')
m
# [('Name','FullName'),('Age_x','Age_y'),('Sex','Gender')]
从内部合并开始:
df3 = (df1.merge(df2,how='inner',on=['ID'])
.reindex(columns=['ID',*(v for V in m for v in V)]))
df3
ID Name FullName Age_x Age_y Sex Gender
0 1B2 Roger Roger 22 21 M M
1 1C3 Stew Rick 23 23 M M
现在,比较列并使用np.where设置值:
l,r = map(list,zip(*m))
matches = (df3[l].eq(df3[r].rename(dict(zip(r,l)),axis=1))
.add_prefix('Result_')
.replace({True: 'Match',False: 'No Match'}))
for k,v in m:
name = f'Result_{k}'
df3.insert(df3.columns.get_loc(v)+1,name,matches[name])
df3.columns
# Index(['ID','Name','FullName','Result_Name','Age_x','Age_y',# 'Result_Age_x','Sex','Gender','Result_Sex'],# dtype='object')
df3.filter(like='Result_')
Result_Name Result_Age_x Result_Sex
0 Match No Match Match
1 No Match Match Match