我想.csv从网上加载一个文件并将其转换为pandas.DataFrame.
这是我想要查找.csv文件的目标页面:
https://vincentarelbundock.github.io/Rdatasets/datasets.html
如何.csv从网页加载相应项目的文件并转换为panda.DataFrame?
另外,如果我也能从.csv网页上获取文件的地址,那将会很棒.
这将允许我创建一个函数来转换目标页面中的项目名称,这将返回.csv文件地址,如:
def data(item):
file = 'https://vincentarelbundock.github.io/Rdatasets/csv/datasets/'+str(item)+'.csv'
Run Code Online (Sandbox Code Playgroud)
但是,网页中csv文件的地址并不完全相同.
例如,
https://vincentarelbundock.github.io/Rdatasets/csv/Stat2Data/Cuckoo.csv
https://vincentarelbundock.github.io/Rdatasets/csv/datasets/cars.csv
Run Code Online (Sandbox Code Playgroud)
相当多的文件都在不同的目录中,所以我需要搜索"items"并获取相应csv文件的地址.
熊猫可以csv直接从http链接中读取:
例;
df = pd.read_csv(
'https://vincentarelbundock.github.io/Rdatasets/'
'csv/datasets/OrchardSprays.csv')
print(df)
Run Code Online (Sandbox Code Playgroud)
结果:
Unnamed: 0 decrease rowpos colpos treatment
0 1 57 1 1 D
1 2 95 2 1 E
.. ... ... ... ... ...
62 63 3 7 8 A
63 64 19 8 8 C
[64 rows x 5 columns]
Run Code Online (Sandbox Code Playgroud)
通过抓取获取链接:
要从首页获取链接本身,我们还可以使用pandasWeb抓取数据.就像是:
base_url = 'https://vincentarelbundock.github.io/Rdatasets/'
url = base_url + 'datasets.html'
import pandas as pd
df = pd.read_html(url, attrs={'class': 'dataframe'},
header=0, flavor='html5lib')[0]
Run Code Online (Sandbox Code Playgroud)
将返回页面上表格中的数据.不幸的是,我们在这里的使用,这不起作用,因为pandas抓取页面上的文本,而不是链接.
猴子修补刮刀获取链接:
为了获取URL,我们可以修改库,如:
def _text_getter(self, obj):
text = obj.text
if text.strip() in ('CSV', 'DOC'):
try:
text = base_url + obj.find('a')['href']
except (TypeError, KeyError):
pass
return text
from pandas.io.html import _BeautifulSoupHtml5LibFrameParser as bsp
bsp._text_getter = _text_getter
Run Code Online (Sandbox Code Playgroud)
测试代码:
base_url = 'https://vincentarelbundock.github.io/Rdatasets/'
url = base_url + 'datasets.html'
import pandas as pd
df = pd.read_html(url, attrs={'class': 'dataframe'},
header=0, flavor='html5lib')[0]
for row in df.head().iterrows():
print('%-14s: %s' % (row[1].Item, row[1].csv))
Run Code Online (Sandbox Code Playgroud)
结果:
AirPassengers: https://vincentarelbundock.github.io/Rdatasets/csv/datasets/AirPassengers.csv
BJsales : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/BJsales.csv
BOD : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/BOD.csv
CO2 : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/CO2.csv
Formaldehyde : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/Formaldehyde.csv
Run Code Online (Sandbox Code Playgroud)