相关疑难解决方法(0)

删除Python中的HTML块

我想知道 Python 中是否有库或某种方法可以从 HTML 文档中提取元素。例如：

我有这个文件：

<html>
      <head>
          ...
      </head>
      <body>
          <div>
           ...
          </div>
      </body>
</html>

Run Code Online (Sandbox Code Playgroud)

我想<div></div>从文档中删除标签块以及块内容，然后它会像这样：

<html>
    <head>
     ...
    </head>
    <body>
    </body>
</html>

Run Code Online (Sandbox Code Playgroud)

html python parsing

Jef*_*onM

2016 08-03

3
推荐指数

1
解决办法

4154
查看次数

Python,从字符串中删除所有html标签

我正在尝试使用以下代码的beautifulsoup从网站访问文章内容:

site= 'www.example.com'
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
content = soup.find_all('p')
content=str(content)

Run Code Online (Sandbox Code Playgroud)

内容对象包含页面中"p"标记内的所有主要文本,但是输出中仍然存在其他标记,如下图所示.我想删除匹配的<>标签对和标签本身所包含的所有字符.这样只留下文字.

我尝试了以下方法,但它似乎不起作用.

' '.join(item for item in content.split() if not (item.startswith('<') and item.endswith('>')))

Run Code Online (Sandbox Code Playgroud)

在sting中删除子串的最佳方法是什么？以某种模式开始和结束,例如<>

html python string parsing beautifulsoup

Mus*_*ger

lucky-day

2
推荐指数

2
解决办法

2万
查看次数

仅使用Python标准库进行html到文本的转换

我正在寻找使用Python 2.7.x标准库中的模块将HTML转换为文本的最佳方法.(即,不BeautifulSoup,等等)

通过HTML到文本的转换,我的意思是道德等同于lynx -dump.实际上,只需智能地删除HTML标记,并将所有HTML实体转换为ASCII(或转换为UTF8编码的unicode)就足够了.

请不要使用基于正则表达式的答案.(正则表达不能完成任务.)

谢谢!

python standard-library html-parsing html-to-text

kjo*_*kjo

lucky-day

1
推荐指数

1
解决办法

1162
查看次数

如何以纯文本形式提取维基百科页面的所有部分？

我在 python 中有以下代码，它仅提取有关“人工智能”的文章的介绍，而我想提取所有子部分（历史、目标...）

import requests

def get_wikipedia_page(page_title):
  endpoint = "https://en.wikipedia.org/w/api.php"
  params = {
    "format": "json",
    "action": "query",
    "prop": "extracts",
    "exintro": "",
    "explaintext": "",
    "titles": page_title
  }
  response = requests.get(endpoint, params=params)
  data = response.json()
  pages = data["query"]["pages"]
  page_id = list(pages.keys())[0]
  return pages[page_id]["extract"]

page_title = "Artificial intelligence"
wikipedia_page = get_wikipedia_page(page_title)

Run Code Online (Sandbox Code Playgroud)

有人建议使用另一种方法来解析html并使用BeautifulSoup转换为文本：

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract() …

Run Code Online (Sandbox Code Playgroud)

python wikipedia-api

bli*_*yes

2022 12-18

0
推荐指数

1
解决办法

1904
查看次数