如何将HTML表格刮到CSV？

Question

如何将HTML表格刮到CSV？

Nat*_*ong 40 screen-scraping

问题

我在工作中使用了一个工具,可以让我查询并获取HTML信息表.我没有任何类型的后端访问权限.

如果我可以将它放入电子表格进行排序,平均等,那么很多这些信息会更有用.如何将这些数据屏蔽到CSV文件？

我的第一个想法

因为我知道jQuery,我想我可能会用它去除屏幕上的表格格式,插入逗号和换行符,然后将整个混乱复制到记事本中并另存为CSV.有更好的想法吗？

解决方案

是的,伙计们,它真的像复制和粘贴一样容易.我不觉得傻.

具体来说,当我粘贴到电子表格中时,我必须选择"选择性粘贴"并选择"文本"格式.否则它会尝试将所有内容粘贴到单个单元格中,即使我突出显示整个电子表格也是如此.

Answer 1

mko*_*ler 33

在工具的UI中选择HTML表格并将其复制到剪贴板中(如果可能的话)
将其粘贴到Excel中.
保存为CSV文件

但是,这是一种手动解决方案而非自动化解决方案.

Answer 2

Tho*_*dur 12

使用python:

例如,假设你想从一些这样的网站凑在CSV形式的外汇报价:fxquotes

然后...

from BeautifulSoup import BeautifulSoup
import urllib,string,csv,sys,os
from string import replace

date_s = '&date1=01/01/08'
date_f = '&date=11/10/08'
fx_url = 'http://www.oanda.com/convert/fxhistory?date_fmt=us'
fx_url_end = '&lang=en&margin_fixed=0&format=CSV&redirected=1'
cur1,cur2 = 'USD','AUD'
fx_url = fx_url + date_f + date_s + '&exch=' + cur1 +'&exch2=' + cur1
fx_url = fx_url +'&expr=' + cur2 +  '&expr2=' + cur2 + fx_url_end
data = urllib.urlopen(fx_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('pre', limit=1))
data = replace(data,'[<pre>','')
data = replace(data,'</pre>]','')
file_location = '/Users/location_edit_this'
file_name = file_location + 'usd_aus.csv'
file = open(file_name,"w")
file.write(data)
file.close()

Run Code Online (Sandbox Code Playgroud)

编辑:从表中获取值:示例来自:palewire

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()

url = "http://www.palewire.com/scrape/albums/2007.html"
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)

table = soup.find("table", border=1)

for row in table.findAll('tr')[1:]:
    col = row.findAll('td')

    rank = col[0].string
    artist = col[1].string
    album = col[2].string
    cover_link = col[3].img['src']

    record = (rank, artist, album, cover_link)
    print "|".join(record)

Run Code Online (Sandbox Code Playgroud)

Answer 3

Jua*_*rro 10

这是我使用(当前)最新版本的BeautifulSoup的python版本,可以使用,例如,

$ sudo easy_install beautifulsoup4

Run Code Online (Sandbox Code Playgroud)

该脚本从标准输入读取HTML,并以适当的CSV格式输出所有表中的文本.

#!/usr/bin/python
from bs4 import BeautifulSoup
import sys
import re
import csv

def cell_text(cell):
    return " ".join(cell.stripped_strings)

soup = BeautifulSoup(sys.stdin.read())
output = csv.writer(sys.stdout)

for table in soup.find_all('table'):
    for row in table.find_all('tr'):
        col = map(cell_text, row.find_all(re.compile('t[dh]')))
        output.writerow(col)
    output.writerow([])

Run Code Online (Sandbox Code Playgroud)

Answer 4

dkr*_*etz 5

更容易(因为它为你下次保存它)...

在Excel中

数据/导入外部数据/新Web查询

会带你到网址提示.输入您的网址,它将分隔要导入的页面上的可用表格.瞧.

Answer 5

n8h*_*rie 5

我想到了两种方法（特别是对于我们这些没有 Excel 的人）：

Google Spreadsheets 有一个出色的importHTML功能：
- =importHTML("http://example.com/page/with/table", "table", index
- 索引从1开始
- 我推荐一个copy并paste values在导入后不久
- 文件 -> 下载为 -> CSV
Python超棒的Pandas库，方便read_html又to_csv功能
- 下面是一个基本的 Python3 脚本，它提示输入 URL、该 URL 上的哪个表以及 CSV 的文件名。

归档时间：	17 年，3 月前
查看次数：	75690 次
最近记录：	8 年，10 月前