Jea*_*Pat 11 python numpy pandas
我正在尝试做一些与上一个问题非常相似的事情,但我遇到了错误.我有一个包含功能和标签的pandas数据框我需要进行一些转换以将功能和标签变量发送到机器学习对象:
import pandas
import milk
from scikits.statsmodels.tools import categorical
Run Code Online (Sandbox Code Playgroud)
然后我有:
trainedData=bigdata[bigdata['meta']<15]
untrained=bigdata[bigdata['meta']>=15]
#print trainedData
#extract two columns from trainedData
#convert to numpy array
features=trainedData.ix[:,['ratio','area']].as_matrix(['ratio','area'])
un_features=untrained.ix[:,['ratio','area']].as_matrix(['ratio','area'])
print 'features'
print features[:5]
##label is a string:single, touching,nuclei,dust
print 'labels'
labels=trainedData.ix[:,['type']].as_matrix(['type'])
print labels[:5]
#convert single to 0, touching to 1, nuclei to 2, dusts to 3
#
tmp=categorical(labels,drop=True)
targets=categorical(labels,drop=True).argmax(1)
print targets
Run Code Online (Sandbox Code Playgroud)
输出控制台首先产生:
features
[[ 0.38846334 0.97681855]
[ 3.8318634 0.5724734 ]
[ 0.67710876 1.01816444]
[ 1.12024943 0.91508699]
[ 7.51749674 1.00156707]]
labels
[[single]
[touching]
[single]
[single]
[nuclei]]
Run Code Online (Sandbox Code Playgroud)
我遇到了以下错误:
Traceback (most recent call last):
File "/home/claire/Applications/ProjetPython/projet particule et objet/karyotyper/DAPI-Trainer02-MILK.py", line 83, in <module>
tmp=categorical(labels,drop=True)
File "/usr/local/lib/python2.6/dist-packages/scikits.statsmodels-0.3.0rc1-py2.6.egg/scikits/statsmodels/tools/tools.py", line 206, in categorical
tmp_dummy = (tmp_arr[:,None]==data).astype(float)
AttributeError: 'bool' object has no attribute 'astype'
Run Code Online (Sandbox Code Playgroud)
是否可以将数据框中的类别变量'type'转换为int?'type'可以取值'single','touching','nuclei','dusts',我需要转换为int值0,1,2,3.
tom*_*omp 18
之前的答案已经过时,因此这里有一个解决方案,用于将字符串映射到适用于Pandas版本0.18.1的数字.
对于系列:
In [1]: import pandas as pd
In [2]: s = pd.Series(['single', 'touching', 'nuclei', 'dusts',
'touching', 'single', 'nuclei'])
In [3]: s_enc = pd.factorize(s)
In [4]: s_enc[0]
Out[4]: array([0, 1, 2, 3, 1, 0, 2])
In [5]: s_enc[1]
Out[5]: Index([u'single', u'touching', u'nuclei', u'dusts'], dtype='object')
Run Code Online (Sandbox Code Playgroud)
对于DataFrame:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'labels': ['single', 'touching', 'nuclei',
'dusts', 'touching', 'single', 'nuclei']})
In [3]: catenc = pd.factorize(df['labels'])
In [4]: catenc
Out[4]: (array([0, 1, 2, 3, 1, 0, 2]),
Index([u'single', u'touching', u'nuclei', u'dusts'],
dtype='object'))
In [5]: df['labels_enc'] = catenc[0]
In [6]: df
Out[4]:
labels labels_enc
0 single 0
1 touching 1
2 nuclei 2
3 dusts 3
4 touching 1
5 single 0
6 nuclei 2
Run Code Online (Sandbox Code Playgroud)
Wes*_*ney 11
如果你有一个字符串或其他对象的向量,并且你想给它分类标签,你可以使用Factor
该类(在pandas
命名空间中可用):
In [1]: s = Series(['single', 'touching', 'nuclei', 'dusts', 'touching', 'single', 'nuclei'])
In [2]: s
Out[2]:
0 single
1 touching
2 nuclei
3 dusts
4 touching
5 single
6 nuclei
Name: None, Length: 7
In [4]: Factor(s)
Out[4]:
Factor:
array([single, touching, nuclei, dusts, touching, single, nuclei], dtype=object)
Levels (4): [dusts nuclei single touching]
Run Code Online (Sandbox Code Playgroud)
该因素有属性labels
和levels
:
In [7]: f = Factor(s)
In [8]: f.labels
Out[8]: array([2, 3, 1, 0, 3, 2, 1], dtype=int32)
In [9]: f.levels
Out[9]: Index([dusts, nuclei, single, touching], dtype=object)
Run Code Online (Sandbox Code Playgroud)
这适用于1D向量,因此不确定它是否可以立即应用于您的问题,但看看.
顺便说一下,我建议你在statsmodels和/或scikit-learn邮件列表上提出这些问题,因为我们大多数人都不是SO用户.
我正在回答Pandas 0.10.1的问题. Factor.from_array
似乎可以做到这一点.
>>> s = pandas.Series(['a', 'b', 'a', 'c', 'a', 'b', 'a'])
>>> s
0 a
1 b
2 a
3 c
4 a
5 b
6 a
>>> f = pandas.Factor.from_array(s)
>>> f
Categorical:
array([a, b, a, c, a, b, a], dtype=object)
Levels (3): Index([a, b, c], dtype=object)
>>> f.labels
array([0, 1, 0, 2, 0, 1, 0])
>>> f.levels
Index([a, b, c], dtype=object)
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
25100 次 |
最近记录: |