计算网页中的单词

Question

计算网页中的单词

Ret*_*ent 0 urllib urllib2 urllib3 python-3.x

我需要计算使用python3在网页内的单词。我应该使用哪个模块？urllib？

这是我的代码：

def web():
    f =("urllib.request.urlopen("https://americancivilwar.com/north/lincoln.html")
    lu = f.read()
    print(lu)

Run Code Online (Sandbox Code Playgroud)

Answer 1

Ced*_*olo 5

使用下面的自解释代码，您可以为计算网页中的单词数提供一个良好的起点：

import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation

# We get the url
r = requests.get("https://en.wikiquote.org/wiki/Khalil_Gibran")
soup = BeautifulSoup(r.content)

# We get the words within paragrphs
text_p = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
c_p = Counter((x.rstrip(punctuation).lower() for y in text_p for x in y.split()))

# We get the words within divs
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
c_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))

# We sum the two countesr and get a list with words count from most to less common
total = c_div + c_p
list_most_common_words = total.most_common()

Run Code Online (Sandbox Code Playgroud)

例如，如果您想要最常见的前10个字，则可以执行以下操作：

total.most_common(10)

Run Code Online (Sandbox Code Playgroud)

在这种情况下输出：

In [100]: total.most_common(10)
Out[100]: 
[('the', 2097),
 ('and', 1651),
 ('of', 998),
 ('in', 625),
 ('i', 592),
 ('a', 529),
 ('to', 529),
 ('that', 426),
 ('is', 369),
 ('my', 365)]

Run Code Online (Sandbox Code Playgroud)

我发现上述方法可能输出不准确的数字，因为段落可以在 div 中，反之亦然。不确定它是如何工作的，但我在网上找到了一个有趣的工具来检查网站内的字数：https://wordcounter.net/website-word-count (2认同)

归档时间：	8 年，2 月前
查看次数：	2690 次
最近记录：	8 年，2 月前