我想将“Genre”功能散列为6列,并将“Publisher”分别添加到另外6列中.我想要下面的东西:
Genre Publisher 0 1 2 3 4 5 0 1 2 3 4 5
0 Platform Nintendo 0.0 2.0 2.0 -1.0 1.0 0.0 0.0 2.0 2.0 -1.0 1.0 0.0
1 Racing Noir -1.0 0.0 0.0 0.0 0.0 -1.0 -1.0 0.0 0.0 0.0 0.0 -1.0
2 Sports Laura -2.0 2.0 0.0 -2.0 0.0 0.0 -2.0 2.0 0.0 -2.0 0.0 0.0
3 Roleplaying John -2.0 2.0 2.0 0.0 1.0 0.0 -2.0 2.0 2.0 0.0 1.0 0.0
4 Puzzle John 0.0 1.0 1.0 -2.0 1.0 -1.0 0.0 1.0 1.0 -2.0 1.0 -1.0
5 Platform Noir 0.0 2.0 2.0 -1.0 1.0 0.0 0.0 2.0 2.0 -1.0 1.0 0.0
以下代码执行我想要做的事情
import pandas as pd
d = {'Genre': ['Platform','Racing','Sports','Roleplaying','Puzzle','Platform'],'Publisher': ['Nintendo','Noir','Laura','John','Noir']}
df = pd.DataFrame(data=d)
from sklearn.feature_extraction import FeatureHasher
fh1 = FeatureHasher(n_features=6,input_type='string')
fh2 = FeatureHasher(n_features=6,input_type='string')
hashed_features1 = fh.fit_transform(df['Genre'])
hashed_features2 = fh.fit_transform(df['Publisher'])
hashed_features1 = hashed_features1.toarray()
hashed_features2 = hashed_features2.toarray()
pd.concat([df[['Genre','Publisher']],pd.DataFrame(hashed_features1),pd.DataFrame(hashed_features2)],axis=1)
最佳答案
哈希(更新)
假设某些功能中可能会显示新类别,则可以使用散列.只需2个便条:
>注意碰撞的可能性并相应地调整功能的数量
>在您的情况下,您希望单独散列每个功能
一个热矢量
如果每个要素的类别数量固定且不太大,请使用一个热编码.
我建议使用以下两种方法之一:
> sklearn.preprocessing.OneHotEncoder
> pandas.get_dummies
例
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction import FeatureHasher
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'feature_1': ['A','G','T','A'],'feature_2': ['cat','dog','elephant','zebra']})
# Approach 0 (Hashing per feature)
n_orig_features = df.shape[1]
hash_vector_size = 6
ct = ColumnTransformer([(f't_{i}',FeatureHasher(n_features=hash_vector_size,input_type='string'),i) for i in range(n_orig_features)])
res_0 = ct.fit_transform(df) # res_0.shape[1] = n_orig_features * hash_vector_size
# Approach 1 (OHV)
res_1 = pd.get_dummies(df)
# Approach 2 (OHV)
res_2 = OneHotEncoder(sparse=False).fit_transform(df)
res_0:
array([[ 0.,0.,1.,-1.,-1.],[ 0.,2.,0.],-2.,-1.]])
res_1:
feature_1_A feature_1_G feature_1_T feature_2_cat feature_2_dog feature_2_elephant feature_2_zebra
0 1 0 0 1 0 0 0
1 0 1 0 0 1 0 0
2 0 0 1 0 0 1 0
3 1 0 0 0 0 0 1
res_2:
array([[1.,[0.,[1.,1.]])