pol*_*ist 7 html beautifulsoup web-scraping pandas python-requests
作为我工作的一部分,我需要定期查看此页面以获取特定文件。我发现我可以使用 pandas 的方法read_html成功地将表读入数据框(这很方便,因为我可以通过关键字轻松查询特定文档)。我现在遇到的问题是这种方法无法解析我需要的链接,而是保存纯文本(特别是我指的是第二列,其中包含诸如“1682/0/15-19”之类的数字)。
我想出的代码非常简单:
import pandas as pd
df = pd.read_html('http://www.vru.gov.ua/act_list')[0]
Run Code Online (Sandbox Code Playgroud)
这给了我一个数据框,其中包含我需要的所有信息,除了链接。
是否有可能以某种方式获取链接而不是纯文本,如果是这样,我该怎么做?
我知道如果我使用了 Requests 和 BeautifulSoup 库,就有可能获得 href 链接,但我不知道 BeautifulSoup 库是否足以做到这一点。有什么提示还是我应该学习 BeautifulSoup?
chi*_*n88 10
您可以通过快速谷歌搜索找到教程。您必须遍历标签以编译列表,然后将数据列表转换为数据框:
您也可以像使用 一样拉出表格read_html(),但您仍然需要返回并获取 html 链接(请参阅下面的选项 2):
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'http://www.vru.gov.ua/act_list'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table')
records = []
columns = []
for tr in table.findAll("tr"):
ths = tr.findAll("th")
if ths != []:
for each in ths:
columns.append(each.text)
else:
trs = tr.findAll("td")
record = []
for each in trs:
try:
link = each.find('a')['href']
text = each.text
record.append(link)
record.append(text)
except:
text = each.text
record.append(text)
records.append(record)
columns.insert(1, 'Link')
df = pd.DataFrame(data=records, columns = columns)
Run Code Online (Sandbox Code Playgroud)
选项 2:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'http://www.vru.gov.ua/act_list'
df = pd.read_html(url)[0]
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table')
links = []
for tr in table.findAll("tr"):
trs = tr.findAll("td")
for each in trs:
try:
link = each.find('a')['href']
links.append(link)
except:
pass
df['Link'] = links
Run Code Online (Sandbox Code Playgroud)
输出:
print (df.to_string())
? Link ????? ??? ????????? ???? ????????? ????? ????????? ????i???
0 1 http://www.vru.gov.ua/act/18641 1682/0/15-19 ??????? 20-06-2019 ??? ?????????? ?????? ?.?. ? ?????? ????? ????...
1 2 http://www.vru.gov.ua/act/18643 1684/0/15-19 ?????? 20-06-2019 ??? ??????????? ????? ????? ????? ???? ???????...
2 3 http://www.vru.gov.ua/act/18644 1685/0/15-19 ?????? 20-06-2019 ??? ??????? ? ??????????? ????? ???????? ?????...
3 4 http://www.vru.gov.ua/act/18649 1690/0/15-19 ?????? 20-06-2019 ??? ??????????? ?????? ???????? ?????? ????? ?...
4 5 http://www.vru.gov.ua/act/18650 1691/0/15-19 ??????? 20-06-2019 ??? ???????????? ?????????????? ????????? ????...
5 6 http://www.vru.gov.ua/act/18651 1692/0/15-19 ??????? 20-06-2019 ??? ?????????? ??????? ????? ????? ???? ??????...
6 7 http://www.vru.gov.ua/act/18619 1660/3??/15-19 ?????? 19-06-2019 ??? ????????? ?????????????? ?????? ???????? ?...
7 8 http://www.vru.gov.ua/act/18620 1661/3??/15-19 ?????? 19-06-2019 ??? ??????? ? ????????? ?????????????? ????? ?...
8 9 http://www.vru.gov.ua/act/18624 1665/3??/15-19 ?????? 19-06-2019 ??o ??????????? ????? ????? ??????? ??????????...
9 10 http://www.vru.gov.ua/act/18626 1667/3??/15-19 ?????? 19-06-2019 ??o ??????????? ????? ????? ??????? ??????????...
10 11 http://www.vru.gov.ua/act/18627 1668/3??/15-19 ?????? 19-06-2019 ??? ??????? ? ????????? ?????????????? ????? ?...
11 12 http://www.vru.gov.ua/act/18628 1669/3??/15-19 ?????? 19-06-2019 ??? ??????? ? ????????? ?????????????? ????? ?...
12 13 http://www.vru.gov.ua/act/18635 1676/2??/15-19 ?????? 19-06-2019 ??? ????????? ?????????????? ?????? ???????? ?...
13 14 http://www.vru.gov.ua/act/18638 1679/2??/15-19 ?????? 19-06-2019 ??? ??????? ? ????????? ?????????????? ?????? ...
14 15 http://www.vru.gov.ua/act/18639 1680/2??/15-19 ?????? 19-06-2019 ??? ??????? ? ????????? ?????????????? ????? ?...
15 16 http://www.vru.gov.ua/act/18640 1681/2??/15-19 ?????? 19-06-2019 ??? ??????? ? ????????? ?????????????? ????? ?...
16 17 http://www.vru.gov.ua/act/18607 1648/0/15-19 ??????? 18-06-2019 ??? ?????????? ????? ?.?. ? ?????? ????? ?????...
17 18 http://www.vru.gov.ua/act/18608 1649/0/15-19 ?????? 18-06-2019 ??? ????????? ??? ???????? ????? ????????? ?.?...
18 19 http://www.vru.gov.ua/act/18609 1650/0/15-19 ?????? 18-06-2019 ??? ????????? ??? ???????? ??????? ??????? ???...
19 20 http://www.vru.gov.ua/act/18610 1651/0/15-19 ?????? 18-06-2019 ??? ????????? ??? ???????? ??????? ?????? ????...
20 21 http://www.vru.gov.ua/act/18615 1656/0/15-19 ??????? 18-06-2019 ??? ???????????? ????????? ?????? ????? ???? ?...
21 22 http://www.vru.gov.ua/act/18586 1627/0/15-19 ??????? 13-06-2019 ??? ?????????? ??????????? ?.?. ...
22 23 http://www.vru.gov.ua/act/18589 1630/0/15-19 ??????? 13-06-2019 ??? ???????????? ???????? ????? ????? ???? ???...
23 24 http://www.vru.gov.ua/act/18590 1631/0/15-19 ??????? 13-06-2019 ??? ??????????? ??????????? ?.?.
24 25 http://www.vru.gov.ua/act/18591 1632/0/15-19 ??????? 13-06-2019 ??? ??????????? ????????? ?.?.
Run Code Online (Sandbox Code Playgroud)
现在可以使用 extract_links 参数在 Pandas 1.5.0+ 中使用此功能。
\nextract_links - possible options: {None, \xe2\x80\x9call\xe2\x80\x9d, \xe2\x80\x9cheader\xe2\x80\x9d, \xe2\x80\x9cbody\xe2\x80\x9d, \xe2\x80\x9cfooter\xe2\x80\x9d}\nRun Code Online (Sandbox Code Playgroud)\n带有标签的指定部分中的表元素将提取其 href。
\nhtml_table = """\n<table>\n<tr>\n <th>GitHub</th>\n</tr>\n<tr>\n <td><a href="https://github.com/pandas-dev/pandas">pandas</a> \n</td>\n</tr>\n</table>\n"""\n\n# this will get you https://github.com/pandas-dev/pandas\ndf = pd.read_html(\n html_table,\n extract_links="all"\n)[0]\nRun Code Online (Sandbox Code Playgroud)\n| 归档时间: |
|
| 查看次数: |
4919 次 |
| 最近记录: |