Kan*_*ngh 2 python beautifulsoup python-2.7
我想通过html文本中的美味汤找到一个特定单词在网页中出现了多少次?我尝试了这个findAll函数,但只发现特定标签中soup.body.findAll的单词会在body标签中找到特定的单词,但我希望它在html文本中的所有标签中搜索该单词.另外,一旦我找到了这个词,我需要在该词之前和之后创建一个单词列表,有人可以帮我怎么做?谢谢.
根据最新的BeautifulSoup 4 api,您可以使用recursive关键字在整个树中查找文本.您将拥有字符串,然后您可以操作并分隔单词.
这是一个完整的例子:
import bs4
import re
data = '''
<html>
<body>
<div>today is a sunny day</div>
<div>I love when it's sunny outside</div>
Call me sunny
<div>sunny is a cool word sunny</div>
</body>
</html>
'''
searched_word = 'sunny'
soup = bs4.BeautifulSoup(data, 'html.parser')
results = soup.body.find_all(string=re.compile('.*{0}.*'.format(searched_word)), recursive=True)
print 'Found the word "{0}" {1} times\n'.format(searched_word, len(results))
for content in results:
words = content.split()
for index, word in enumerate(words):
# If the content contains the search word twice or more this will fire for each occurence
if word == searched_word:
print 'Whole content: "{0}"'.format(content)
before = None
after = None
# Check if it's a first word
if index != 0:
before = words[index-1]
# Check if it's a last word
if index != len(words)-1:
after = words[index+1]
print '\tWord before: "{0}", word after: "{1}"'.format(before, after)
Run Code Online (Sandbox Code Playgroud)
它输出:
Found the word "sunny" 4 times
Whole content: "today is a sunny day"
Word before: "a", word after: "day"
Whole content: "I love when it's sunny outside"
Word before: "it's", word after: "outside"
Whole content: "
Call me sunny
"
Word before: "me", word after: "None"
Whole content: "sunny is a cool word sunny"
Word before: "None", word after: "is"
Whole content: "sunny is a cool word sunny"
Word before: "word", word after: "None"
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
7358 次 |
| 最近记录: |