在python中将html表转换为csv

Ale*_*ont 5 python selenium beautifulsoup web-scraping pandas

我正试图从动态页面刮一张桌子.在以下代码(需要selenium)之后,我设法获取<table>元素的内容.

我想把这个表转换成csv,我尝试了两件事,但都失败了:

  • pandas.read_html 返回一个错误,说我没有安装html5lib,但我这样做,实际上我可以毫无问题地导入它.
  • soup.find_all('tr')'NoneType' object is not callable运行后返回错误soup = BeautifulSoup(tablehtml)

这是我的代码:

import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.common.keys import Keys
import pandas as pd

main_url = "http://data.stats.gov.cn/english/easyquery.htm?cn=E0101"
driver = webdriver.Firefox()
driver.get(main_url)
time.sleep(7)
driver.find_element_by_partial_link_text("Industry").click()
time.sleep(7)
driver.find_element_by_partial_link_text("Main Economic Indicat").click()
time.sleep(6)
driver.find_element_by_id("mySelect_sj").click()
time.sleep(2)
driver.find_element_by_class_name("dtText").send_keys("last72")
time.sleep(3)
driver.find_element_by_class_name("dtTextBtn").click()
time.sleep(2)
table=driver.find_element_by_id("table_main")
tablehtml= table.get_attribute('innerHTML')
Run Code Online (Sandbox Code Playgroud)

AXO*_*AXO 9

在这里使用csv模块和selenium选择器可能会更方便:

import csv
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://example.com/")
table = driver.find_element_by_css_selector("#tableid")
with open('eggs.csv', 'w', newline='') as csvfile:
    wr = csv.writer(csvfile)
    for row in table.find_elements_by_css_selector('tr'):
        wr.writerow([d.text for d in row.find_elements_by_css_selector('td')])
Run Code Online (Sandbox Code Playgroud)


Kru*_*ger 7

如果没有访问你实际上试图抓住的表,我使用了这个例子:

<table>
<thead>
<tr>
    <td>Header1</td>
    <td>Header2</td>
    <td>Header3</td>
</tr>
</thead>  
<tr>
    <td>Row 11</td>
    <td>Row 12</td>
    <td>Row 13</td>
</tr>
<tr>
    <td>Row 21</td>
    <td>Row 22</td>
    <td>Row 23</td>
</tr>
<tr>
    <td>Row 31</td>
    <td>Row 32</td>
    <td>Row 33</td>
</tr>
</table>
Run Code Online (Sandbox Code Playgroud)

并使用以下方法刮掉它:

from bs4 import BEautifulSoup as BS
content = #contents of that table
soup = BS(content, 'html5lib')
rows = [tr.findAll('td') for tr in soup.findAll('tr')]
Run Code Online (Sandbox Code Playgroud)

对象是列表列表:

[
    [<td>Header1</td>, <td>Header2</td>, <td>Header3</td>],
    [<td>Row 11</td>, <td>Row 12</td>, <td>Row 13</td>],
    [<td>Row 21</td>, <td>Row 22</td>, <td>Row 23</td>],
    [<td>Row 31</td>, <td>Row 32</td>, <td>Row 33</td>]
]
Run Code Online (Sandbox Code Playgroud)

...并且您可以将其写入文件:

for it in rows:
with open('result.csv', 'a') as f:
    f.write(", ".join(str(e).replace('<td>','').replace('</td>','') for e in it) + '\n')
Run Code Online (Sandbox Code Playgroud)

看起来像这样:

Header1, Header2, Header3
Row 11, Row 12, Row 13
Row 21, Row 22, Row 23
Row 31, Row 32, Row 33
Run Code Online (Sandbox Code Playgroud)