从HTML到Markdown的html2text

Question

从HTML到Markdown的html2text

Qro*_*rom 5 html python markdown parsing

我可以使用html2text库将一些HTML代码成功地转换为python中的markdown，它看起来像这样：

def mark_down_formatting(html_text, url):
    h = html2text.HTML2Text()

    # Options to transform URL into absolute links
    h.body_width = 0
    h.protect_links = True
    h.wrap_links = False
    h.baseurl = url

    md_text = h.handle(html_text)

    return md_text

Run Code Online (Sandbox Code Playgroud)

暂时很不错，但是有一定的局限性，因为我没有找到任何方法来定制文档的输出。

实际上，我不需要太多自定义，我只需要将此HTML标签<span class="searched_found">example text</span>在markdown中转换为我提供的任何内容。可能是这个+example text+

因此，我正在寻找问题的解决方案，因为html2text是一个很好的库，它允许我配置一些选项，例如我在超链接中显示的那些，因此基于此库提供解决方案将是一件很不错的事情。

更新：

我有一个使用BeautifulSoup库的解决方案，但我认为它是一个临时补丁，因为它增加了另一个依赖关系，并且增加了很多不必要的处理。我在这里所做的是在解析为markdown 之前编辑HTML ：

def processing_to_markdown(html_text, url, delimiter):
    # Not using "lxml" parser since I get to see a lot of different HTML
    # and the "lxml" parser tend to drop content when parsing very big HTML
    # that has some errors inside
    soup = BeautifulSoup(html_text, "html.parser")

    # Finds all <span class="searched_found">...</span> tags
    for tag in soup.findAll('span', class_="searched_found"):
        tag.string = delimiter + tag.string + delimiter
        tag.unwrap()  # Removes the tags to only keep the text

    html_text = unicode(soup)

    return mark_down_formatting(html_text, url)

Run Code Online (Sandbox Code Playgroud)

对于非常长的HTML内容，当我们对HTML进行两次解析（一次是BeautifulSoup然后是html2text）时，这被证明相当慢。

Answer 1

Mil*_*kus 4

markdownify可以提供帮助

markdownify 使用 BeautifulSoup 进行解析

soup = BeautifulSoup(html, 'html.parser')

Run Code Online (Sandbox Code Playgroud)

转换可以定制

import markdownify

"""
https://stackoverflow.com/questions/45034227/html-to-markdown-with-html2text
https://beautiful-soup-4.readthedocs.io/en/latest/#multi-valued-attributes
https://beautiful-soup-4.readthedocs.io/en/latest/#contents-and-children
"""

class CustomMarkdownConverter(markdownify.MarkdownConverter):
    def convert_a(self, el, text, convert_as_inline):
        classList = el.get("class")
        if classList and "searched_found" in classList:
            # custom transformation
            # unwrap child nodes of <a class="searched_found">
            text = ""
            for child in el.children:
                text += super().process_tag(child, convert_as_inline)
            return text
        # default transformation
        return super().convert_a(el, text, convert_as_inline)

# Create shorthand method for conversion
def md4html(html, **options):
    return CustomMarkdownConverter(**options).convert(html)

md = md4html("""<a class="searched_found"><b>hello</b> world</a>""")

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，11 月前
查看次数：	1636 次
最近记录：	8 年，11 月前