小编Jos*_*Lee的帖子

如何在Python中使用BeautifulSoup在文本字符串后面找到一个表？

我试图从几个网页中提取数据,这些网页在显示表格方面不一致.我需要编写将搜索文本字符串的代码,然后立即转到该特定文本字符串后面的表.然后我想提取该表的内容.这是我到目前为止所得到的:

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile('Table 1',re.IGNORECASE) # Also need to figure out how to ignore space
foundtext = soup.findAll('p',text=searchtext)
soupafter = foundtext.findAllNext()
table = soupafter.find('table') # find the next table …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup

Jos*_*Lee

lucky-day

4
推荐指数

1
解决办法

6765
查看次数

如何在Python中使用BeautifulSoup删除HTML标记之间的空格？

我有以下问题:当html标签之间有空格时,我的代码不会给我输出的文本.

而不是输出:

year|salary|bonus
2005|100,000|50,000
2006|120,000|80,000

Run Code Online (Sandbox Code Playgroud)

我得到了这个:

 |salary|bonus
2005|100,000|50,000
2006|120,000|80,000

Run Code Online (Sandbox Code Playgroud)

未输出文本"年份".

这是我的代码:

from BeautifulSoup import BeautifulSoup
import re


html = '<html><body><table><tr><td> <p>year</p></td><td><p>salary</p></td><td>bonus</td></tr><tr><td>2005</td><td>100,000</td><td>50,000</td></tr><tr><td>2006</td><td>120,000</td><td>80,000</td></tr></table></html>'
soup = BeautifulSoup(html)
table = soup.find('table')
rows = table.findAll('tr')

store=[]

for tr in rows:
    cols = tr.findAll('td')
    row = []
    for td in cols:
        try:
            row.append(''.join(td.find(text=True)))
        except Exception:
            row.append('')
    store.append('|'.join(filter(None, row)))
print '\n'.join(store)

Run Code Online (Sandbox Code Playgroud)

问题来自于: