使用pandas将字符串对象转换为int/float

cru*_*xer 21 python csv pandas

import pandas as pd

path1 = "/home/supertramp/Desktop/100&life_180_data.csv"

mydf =  pd.read_csv(path1)

numcigar = {"Never":0 ,"1-5 Cigarettes/day" :1,"10-20 Cigarettes/day":4}

print mydf['Cigarettes']

mydf['CigarNum'] = mydf['Cigarettes'].apply(numcigar.get).astype(float)

print mydf['CigarNum']

mydf.to_csv('/home/supertramp/Desktop/powerRangers.csv')
Run Code Online (Sandbox Code Playgroud)

csv文件"100&life_180_data.csv"包含age,bmi,Cigarettes,Alocohol等列.

No                int64
Age               int64
BMI             float64
Alcohol          object
Cigarettes       object
dtype: object
Run Code Online (Sandbox Code Playgroud)

香烟专栏包含"Never""1-5 Cigarettes/day","10-20 Cigarettes/day".我想为这些物体分配重量(从不,1-5根香烟/天,......)

预期的输出是附加的新列CigarNum,其仅包含数字0,1,2 CigarNum如预期的那样直到8行然后显示Nan直到CigarNum列中的最后一行

0                     Never
1                     Never
2        1-5 Cigarettes/day
3                     Never
4                     Never
5                     Never
6                     Never
7                     Never
8                     Never
9                     Never
10                    Never
11                    Never
12     10-20 Cigarettes/day
13       1-5 Cigarettes/day
14                    Never
...
167                    Never
168                    Never
169     10-20 Cigarettes/day
170                    Never
171                    Never
172                    Never
173                    Never
174                    Never
175                    Never
176                    Never
177                    Never
178                    Never
179                    Never
180                    Never
181                    Never
Name: Cigarettes, Length: 182, dtype: object
Run Code Online (Sandbox Code Playgroud)

我得到的输出几乎没有在第一行之后给出NaN.

0      0
1      0
2      1
3      0
4      0
5      0
6      0
7      0
8      0
9      0
10   NaN
11   NaN
12   NaN
13   NaN
14     0
...
167   NaN
168   NaN
169   NaN
170   NaN
171   NaN
172   NaN
173   NaN
174   NaN
175   NaN
176   NaN
177   NaN
178   NaN
179   NaN
180   NaN
181   NaN
Name: CigarNum, Length: 182, dtype: float64
Run Code Online (Sandbox Code Playgroud)

EdC*_*ica 33

好的,首先问题是你有嵌入空格导致函数错误地应用:

使用vectorised修复此问题str:

mydf['Cigarettes'] = mydf['Cigarettes'].str.replace(' ', '')
Run Code Online (Sandbox Code Playgroud)

现在创建新列应该正常工作:

mydf['CigarNum'] = mydf['Cigarettes'].apply(numcigar.get).astype(float)
Run Code Online (Sandbox Code Playgroud)

UPDATE

感谢@Jeff一如既往地指出了卓越的做事方式:

所以你可以打电话replace而不是打电话apply:

mydf['CigarNum'] = mydf['Cigarettes'].replace(numcigar)
# now convert the types
mydf['CigarNum'] = mydf['CigarNum'].convert_objects(convert_numeric=True)
Run Code Online (Sandbox Code Playgroud)

你也可以使用factorize方法.

想一想为什么不将dict值设置为浮点数然后你避免类型转换?

所以:

numcigar = {"Never":0.0 ,"1-5 Cigarettes/day" :1.0,"10-20 Cigarettes/day":4.0}
Run Code Online (Sandbox Code Playgroud)

版本0.17.0或更高版本

convert_objects因此被弃用0.17.0,已被替换为to_numeric

mydf['CigarNum'] = pd.to_numeric(mydf['CigarNum'], errors='coerce')
Run Code Online (Sandbox Code Playgroud)

这里errors='coerce'将返回NaN值无法转换为数值的位置,如果没有这将引发异常