mer*_*980 13 python csv beautifulsoup python-2.7
晚上好,我使用BeautifulSoup从网站中提取一些数据如下:
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
soup = BeautifulSoup(urlopen('http://www.fsa.gov.uk/about/media/facts/fines/2002'))
table = soup.findAll('table', attrs={ "class" : "table-horizontal-line"})
print table
Run Code Online (Sandbox Code Playgroud)
这给出了以下输出:
[<table width="70%" class="table-horizontal-line">
<tr>
<th>Amount</th>
<th>Company or person fined</th>
<th>Date</th>
<th>What was the fine for?</th>
<th>Compensation</th>
</tr>
<tr>
<td><a name="_Hlk74714257" id="_Hlk74714257"> </a>£4,000,000</td>
<td><a href="/pages/library/communication/pr/2002/124.shtml">Credit Suisse First Boston International </a></td>
<td>19/12/02</td>
<td>Attempting to mislead the Japanese regulatory and tax authorities</td>
<td> </td>
</tr>
<tr>
<td>£750,000</td>
<td><a href="/pages/library/communication/pr/2002/123.shtml">Royal Bank of Scotland plc</a></td>
<td>17/12/02</td>
<td>Breaches of money laundering rules</td>
<td> </td>
</tr>
<tr>
<td>£1,000,000</td>
<td><a href="/pages/library/communication/pr/2002/118.shtml">Abbey Life Assurance Company ltd</a></td>
<td>04/12/02</td>
<td>Mortgage endowment mis-selling and other failings</td>
<td>Compensation estimated to be between £120 and £160 million</td>
</tr>
<tr>
<td>£1,350,000</td>
<td><a href="/pages/library/communication/pr/2002/087.shtml">Royal & Sun Alliance Group</a></td>
<td>27/08/02</td>
<td>Pension review failings</td>
<td>Redress exceeding £32 million</td>
</tr>
<tr>
<td>£4,000</td>
<td><a href="/pubs/final/ft-inv-ins_7aug02.pdf" target="_blank">F T Investment & Insurance Consultants</a></td>
<td>07/08/02</td>
<td>Pensions review failings</td>
<td> </td>
</tr>
<tr>
<td>£75,000</td>
<td><a href="/pubs/final/spe_18jun02.pdf" target="_blank">Seymour Pierce Ellis ltd</a></td>
<td>18/06/02</td>
<td>Breaches of FSA Principles ("skill, care and diligence" and "internal organization")</td>
<td> </td>
</tr>
<tr>
<td>£120,000</td>
<td><a href="/pages/library/communication/pr/2002/051.shtml">Ward Consultancy plc</a></td>
<td>14/05/02</td>
<td>Pension review failings</td>
<td> </td>
</tr>
<tr>
<td>£140,000</td>
<td><a href="/pages/library/communication/pr/2002/036.shtml">Shawlands Financial Services ltd</a> - formerly Frizzell Life & Financial Planning ltd)</td>
<td>11/04/02</td>
<td>Record keeping and associated compliance breaches</td>
<td> </td>
</tr>
<tr>
<td>£5,000</td>
<td><a href="/pubs/final/woodwards_4apr02.pdf" target="_blank">Woodward's Independent Financial Advisers</a></td>
<td>04/04/02</td>
<td>Pensions review failings</td>
<td> </td>
</tr>
</table>]
Run Code Online (Sandbox Code Playgroud)
我想将此导出为CSV,同时保持网站上显示的表格结构,这是否可能,如果是这样,怎么办?
在此先感谢您的帮助.
Roc*_*key 27
这是您可以尝试的基本方法.这假设headers所有<th>标记都在标记中,并且所有后续数据都在<td>标记中.这适用于你提供的单个案例,但我确信如果其他情况需要调整:)一般的想法是,一旦你找到你的table(这里find用来拉第一个),我们headers通过遍历所有th元素得到,将它们存储在列表中.然后,我们创建一个rows列表,其中包含表示每行内容的列表.这是通过查找标签td下的所有元素tr并将其text编码为UTF-8(来自Unicode)来填充的.然后打开一个CSV,编写第headers一个,然后写入所有rows, but using(行中的行,如果行)`以消除任何空白行:
In [117]: import csv
In [118]: from bs4 import BeautifulSoup
In [119]: from urllib2 import urlopen
In [120]: soup = BeautifulSoup(urlopen('http://www.fsa.gov.uk/about/media/facts/fines/2002'))
In [121]: table = soup.find('table', attrs={ "class" : "table-horizontal-line"})
In [122]: headers = [header.text for header in table.find_all('th')]
In [123]: rows = []
In [124]: for row in table.find_all('tr'):
.....: rows.append([val.text.encode('utf8') for val in row.find_all('td')])
.....:
In [125]: with open('output_file.csv', 'wb') as f:
.....: writer = csv.writer(f)
.....: writer.writerow(headers)
.....: writer.writerows(row for row in rows if row)
.....:
In [126]: cat output_file.csv
Amount,Company or person fined,Date,What was the fine for?,Compensation
" £4,000,000",Credit Suisse First Boston International ,19/12/02,Attempting to mislead the Japanese regulatory and tax authorities,
"£750,000",Royal Bank of Scotland plc,17/12/02,Breaches of money laundering rules,
"£1,000,000",Abbey Life Assurance Company ltd,04/12/02,Mortgage endowment mis-selling and other failings,Compensation estimated to be between £120 and £160 million
"£1,350,000",Royal & Sun Alliance Group,27/08/02,Pension review failings,Redress exceeding £32 million
"£4,000",F T Investment & Insurance Consultants,07/08/02,Pensions review failings,
"£75,000",Seymour Pierce Ellis ltd,18/06/02,"Breaches of FSA Principles (""skill, care and diligence"" and ""internal organization"")",
"£120,000",Ward Consultancy plc,14/05/02,Pension review failings,
"£140,000",Shawlands Financial Services ltd - formerly Frizzell Life & Financial Planning ltd),11/04/02,Record keeping and associated compliance breaches,
"£5,000",Woodward's Independent Financial Advisers,04/04/02,Pensions review failings,
Run Code Online (Sandbox Code Playgroud)