如何使用python检测网页内容的语言

Question

如何使用python检测网页内容的语言

我必须测试一堆网址，这些网页是否有各自的翻译内容。有没有办法使用Python语言返回网页内容的语言？就像如果页面是中文的，那么它应该返回“Chinese”。

我用langdetect模块检查了它，但无法得到我想要的结果。这些 URL 为 Web xml 格式。内容显示在下面<releasehigh>

Answer 1

下面是一个简单的示例，演示如何使用BeautifulSoup提取 HTML 正文文本并使用 langDetect进行语言检测：

from bs4 import BeautifulSoup
from langdetect import detect

with open("foo.html", "rb") as f:
    soup = BeautifulSoup(f, "lxml")
    [s.decompose() for s in soup("script")]  # remove <script> elements
    body_text = soup.body.get_text()
    print(detect(body_text))

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，8 月前
查看次数：	4761 次
最近记录：	5 年，4 月前