我正在进行具有两个数据帧的机器学习计算 – 一个用于因子,另一个用于目标值.我必须将它们分成训练和测试部分.在我看来,我找到了方法,但我正在寻找更优雅的解决方案.这是我的代码:
import pandas as pd import numpy as np import random df_source = pd.DataFrame(np.random.randn(5,2),index = range(0,10,columns=list('AB')) df_target = pd.DataFrame(np.random.randn(5,columns=list('CD')) rows = np.asarray(random.sample(range(0,len(df_source)),2)) df_source_train = df_source.iloc[rows] df_source_test = df_source[~df_source.index.isin(df_source_train.index)] df_target_train = df_target.iloc[rows] df_target_test = df_target[~df_target.index.isin(df_target_train.index)] print('rows') print(rows) print('source') print(df_source) print('source train') print(df_source_train) print('source_test') print(df_source_test)
—-编辑 – unutbu解决方案(midified)—
np.random.seed(2013) percentile = .6 rows = np.random.binomial(1,percentile,size=len(df_source)).astype(bool) df_source_train = df_source[rows] df_source_test = df_source[~rows] df_target_train = df_target[rows] df_target_test = df_target[~rows]
解决方法
如果你将行设为长度为len(df)的布尔数组,则可以使用df [rows]获取True行,并使用df [〜rows]获取False行:
import pandas as pd import numpy as np import random np.random.seed(2013) df_source = pd.DataFrame( np.random.randn(5,index=range(0,columns=list('AB')) rows = np.random.randint(2,size=len(df_source)).astype('bool') df_source_train = df_source[rows] df_source_test = df_source[~rows] print(rows) # [ True True False True False] # if for some reason you need the index values of where `rows` is True print(np.where(rows)) # (array([0,1,3]),) print(df_source) # A B # 0 0.279545 0.107474 # 2 0.651458 -1.516999 # 4 -1.320541 0.679631 # 6 0.833612 0.492572 # 8 1.555721 1.741279 print(df_source_train) # A B # 0 0.279545 0.107474 # 2 0.651458 -1.516999 # 6 0.833612 0.492572 print(df_source_test) # A B # 4 -1.320541 0.679631 # 8 1.555721 1.741279