我正在尝试将一个非常混乱的多嵌套JSON加载到熊猫中.我已经在使用json_normalize,但是试图弄清楚我如何加入2个similair嵌套字典,以及解压缩它们的子字典和列表,这一直困扰着我.我对大熊猫的了解有限,但我假设如果可以解决的话,我可以利用它的性能优势.
我有2个包含战争数据的字典,一个是从JSON API响应中加载的,另一个是在数据库中的.我正在尝试比较2个新的攻击和防御.
战争实例
{
"state": "active","team_size": 20,"teams": {
"id": "12345679","name": "Good Guys","level": 10,"attacks": 4,"destruction_percentage": 22.6,"members": [
{
"id": "1","name": "John","level": 12
},{
"id": "2","name": "Tom","level": 11,"attacks": [
{
"attackerTag": "2","defenderTag": "4","damage": 64,"order": 7
}
]
}
]
},"opponent": {
"id": "987654321","name": "Bad Guys","level": 17,"attacks": 5,"damage": 20.95,"members": [
{
"id": "3","name": "Betty","attacks": [
{
"attacker_id": "3","defender_id": "1","damage": 70,"order": 1
},{
"attacker_id": "3","defender_id": "7","damage": 100,"order": 11
}
],"opponentAttacks": 0,"some_useless_data": "Want to ignore,this doesn't show in every record"
},{
"id": "4","name": "Fred","level": 9,"attacks": [
{
"attacker_id": "4","defender_id": "9","order": 4
}
],"opponentAttacks": 0
}
]
}
}
现在,我假设就性能而言,熊猫将是我最好的选择,而不是将它们拉在一起并遍历每个成员并进行比较.
因此,我试图获得一个很好地展平且易于遍历的数据框的尝试至少很难说.最佳情况下,我假设采用以下布局.我只是想让两个团队都只有所有成员进入一个df.我们可以省略state和team_size键,而只关注获得每个成员及其各自的攻击和team_id
df示例(预期结果):
id name level attacks member.team_id ...
1 John 12 NaN "123456789"
2 Tom 11 [{...}] "123456789"
3 Betty 17 [{...},{...}] "987654321"
4 Fred 9 [{...}] "987654321"
这就是我想要df的基本要点.因此,我可以同时使用两个数据帧并比较新的攻击.
注意,在我尝试之前,我只是从字典中弹出()状态和team_size,因为我想要的是所有成员,团队几乎嵌入其中
我没有运气就尝试了以下方法,我知道这不是正确的方法,因为它在字典树上向后工作.
old_df = json_normalize(war,'members',['id','name','level','attacks'],record_prefix='member')
#Traceback (most recent call last):
# File "test.py",line 83,in <module>
# new_parse(old_war,new_war)
# File "test.py",line 79,in new_parse
# record_prefix='member')
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py",line 262,in json_normalize
# _recursive_extract(data,record_path,{},level=0)
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py",line 238,in _recursive_extract
# recs = _pull_field(obj,path[0])
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py",line 185,in _pull_field
# result = result[spec]
#KeyError: 'members'
df = pd.DataFrame.from_dict(old,orient='index')
df.droplevel('members')
#Traceback (most recent call last):
# File "test.py",line 106,line 87,in new_parse
# df.droplevel('members')
# File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/core/generic.py",line 4376,in __getattr__
# return object.__getattribute__(self,name)
#AttributeError: 'DataFrame' object has no attribute 'droplevel'
我感谢任何指导!希望我投入了足够的精力来帮助理解我的预期结果,如果没有,请告诉我!
编辑
公平地说,我确实知道如何做到这一点,只是循环字典并创建一个具有适当日期的新成员列表,但是我觉得这比使用熊猫要低得多,因为我要在数百万次战争中这样做线程化的应用程序以及我可以从中获得的每一点性能,这对我和应用程序都是一个好处. – 再次感谢!
最佳答案
我相信您可以使用:
need = ['member.id','member.name','member.level','member.attacks','id']
df1 = json_normalize(war['teams'],record_prefix='member.')[need]
#print (df1)
df2 = json_normalize(war['opponent'],record_prefix='member.')[need]
#print (df2)
df1.columns = np.where(df1.columns.str.startswith('member.'),df1.columns.str.split('.',n=1).str[1],'member.' + df1.columns)
df2.columns = np.where(df2.columns.str.startswith('member.'),df2.columns.str.split('.','member.' + df2.columns)
df = pd.concat([df1,df2],sort=False,ignore_index=True)
print (df)
id name level attacks \
0 1 John 12 NaN
1 2 Tom 11 [{'attackerTag': '2','defenderTag': '4','dam...
2 3 Betty 17 [{'attacker_id': '3','defender_id': '1','dam...
3 4 Fred 9 [{'attacker_id': '4','defender_id': '9','dam...
member.id
0 12345679
1 12345679
2 987654321
3 987654321