Tuy*_*yen 3 data-import python-3.x
我正在尝试读取名为df1的数据集,但它不起作用
import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";")
df1.head()
Run Code Online (Sandbox Code Playgroud)
这是上述代码中的重大错误,但这是最相关的
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
Run Code Online (Sandbox Code Playgroud)
事实证明,在 mac os 中创建的 csv 正在 Windows 机器上解析,我得到了 UnicodeDecodeError。要消除此错误,请尝试将参数encoding ='mac-roman'传递给pandas库的read_csv方法。
import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";", encoding='mac_roman')
df1.head()
Run Code Online (Sandbox Code Playgroud)
输出:
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Unnamed: 15 2014 2015
0 Afghanistan 55.1 55.5 55.9 56.2 56.6 57.0 57.4 57.8 58.2 58.6 59.0 59.3 59.7 60.0 NaN 60.4 60.7
1 Albania 74.3 74.7 75.2 75.5 75.8 76.1 76.3 76.5 76.7 76.8 77.0 77.2 77.4 77.6 NaN 77.8 78.0
2 Algeria 70.2 70.6 71.0 71.4 71.8 72.2 72.6 72.9 73.2 73.5 73.8 74.1 74.3 74.6 NaN 74.8 75.0
3 American Samoa .. .. .. .. .. .. .. .. .. .. .. .. .. .. NaN .. ..
4 Andorra .. .. .. .. .. .. .. .. .. .. .. .. .. .. NaN .. ..
Run Code Online (Sandbox Code Playgroud)
数据确实未编码为UTF-8。一切都是ASCII,除了那个0x92字节:
b'Korea, Dem. People\x92s Rep.'
Run Code Online (Sandbox Code Playgroud)
改为将其解码为Windows代码页1252,其中0x92是引号’:
df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
sep=";", encoding='cp1252')
Run Code Online (Sandbox Code Playgroud)
演示:
>>> import pandas as pd
>>> df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
... sep=";", encoding='cp1252')
>>> df1.head()
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 \
0 Afghanistan 55.1 55.5 55.9 56.2 56.6 57.0 57.4 57.8 58.2 58.6
1 Albania 74.3 74.7 75.2 75.5 75.8 76.1 76.3 76.5 76.7 76.8
2 Algeria 70.2 70.6 71.0 71.4 71.8 72.2 72.6 72.9 73.2 73.5
3 American Samoa .. .. .. .. .. .. .. .. .. ..
4 Andorra .. .. .. .. .. .. .. .. .. ..
2010 2011 2012 2013 Unnamed: 15 2014 2015
0 59.0 59.3 59.7 60.0 NaN 60.4 60.7
1 77.0 77.2 77.4 77.6 NaN 77.8 78.0
2 73.8 74.1 74.3 74.6 NaN 74.8 75.0
3 .. .. .. .. NaN .. ..
4 .. .. .. .. NaN .. ..
Run Code Online (Sandbox Code Playgroud)
但我注意到,大熊猫似乎采取按面值的HTTP头太大,当你从一个URL加载数据产生变为乱码。当我将数据直接保存到磁盘时,然后使用pd.read_csv()正确解码的数据加载它,但是从URL加载会产生重新编码的数据:
>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
>>> df1[' '][102].encode('cp1252').decode('utf8')
'Korea, Dem. People’s Rep.'
Run Code Online (Sandbox Code Playgroud)
这是熊猫中的一个已知错误。您可以通过使用urllib.request加载URL并将其传递给它来解决此问题pd.read_csv():
>>> import urllib.request
>>> with urllib.request.urlopen("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv") as resp:
... df1 = pd.read_csv(resp, sep=";", encoding='cp1252')
...
>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
8864 次 |
| 最近记录: |