我有以下广泛的数据集:
import pandas as pd
from io import StringIO
testcsv = """P,N,N_relerr,F,F_relerr
10,6073.98,0.0022,61.973,0.0036
12,6412.97,0.0021,65.405,0.0036
4,4141.24,0.0019,42.8202,0.0032
6,5009.83,51.9615,0.0031
8,5601.87,0.0025,57.8129,0.0042"""
csvfile = StringIO(testcsv)
df = pd.read_csv(csvfile)
P N N_relerr F F_relerr
0 10 6073.98 0.0022 61.9730 0.0036
1 12 6412.97 0.0021 65.4050 0.0036
2 4 4141.24 0.0019 42.8202 0.0032
3 6 5009.83 0.0019 51.9615 0.0031
4 8 5601.87 0.0025 57.8129 0.0042
我想变成一个长数据集,该数据集具有“计数”(N和F列)以及相关的错误(N_relerr和F_relerr):
P which count err
0 10 N 6073.9800 0.0022
1 12 N 6412.9700 0.0021
2 4 N 4141.2400 0.0019
3 6 N 5009.8300 0.0019
4 8 N 5601.8700 0.0025
5 10 F 61.9730 0.0036
6 12 F 65.4050 0.0036
7 4 F 42.8202 0.0032
8 6 F 51.9615 0.0031
9 8 F 57.8129 0.0042
因为这是格式,所以我需要使用带有’N’和’F’计数彼此区分的plotnine绘制误差线.我当前非常难看的解决方案是:
dflong = (df[['P','N','F']]
.melt(id_vars=['P'],var_name='which',value_name='count'))
dferr = (df[['P','N_relerr','F_relerr']]
.melt(id_vars=['P'],value_name='count_relerr'))
dflong['err'] = dferr['count_relerr'].copy()
我的猜测是,有一个优雅的方法可以使用multiindex列以及堆栈,从看起来像这样的数据集开始:
N F
P counts relerr counts relerr
0 10 6073.98 0.0022 61.9730 0.0036
1 12 6412.97 0.0021 65.4050 0.0036
2 4 4141.24 0.0019 42.8202 0.0032
3 6 5009.83 0.0019 51.9615 0.0031
4 8 5601.87 0.0025 57.8129 0.0042
我可以通过以下方式创建该数据框:
cols = {'P': 'P','N': ('N','counts'),'N_relerr': ('N',"relerr"),'F': ('F','F_relerr': ('F','relerr')}
nested_df = df.rename(columns=cols)
nested_df.columns = [c if isinstance(c,tuple)
else ('',c) for c in nested_df.columns]
nested_df.columns = pd.MultiIndex.from_tuples(nested_df.columns)
(我认为必须有一个更好的方法),但是我还没有弄清楚如何有效地使用堆栈来获得我想要的东西.
有人知道规范的解决方案吗?谢谢!
最佳答案
您可以使用
pd.wide_to_long
,非常适合那种“同时融化”的情况,只需稍微重命名列即可.
import pandas as pd
from io import StringIO
testcsv = """P,0.0042"""
csvfile = StringIO(testcsv)
df = pd.read_csv(csvfile)
#Rename columns with set_axis
d1 = df.set_axis(['P','Count_N','Err_N','Count_F','Err_F'],axis=1,inplace=False)
#Use pd.wide_to_long to reshape dataframe
pd.wide_to_long(d1,['Count','Err'],'P','which',sep='_',suffix='.+')
输出:
Count Err
P which
10 N 6073.9800 0.0022
12 N 6412.9700 0.0021
4 N 4141.2400 0.0019
6 N 5009.8300 0.0019
8 N 5601.8700 0.0025
10 F 61.9730 0.0036
12 F 65.4050 0.0036
4 F 42.8202 0.0032
6 F 51.9615 0.0031
8 F 57.8129 0.0042