使用 Mojo::DOM 处理 HTML 文档时，如何最可靠地保留 HTML 实体？

Question

使用 Mojo::DOM 处理 HTML 文档时，如何最可靠地保留 HTML 实体？

Dav*_*llo 6 perl movabletype html-entities mojolicious

我正在使用Mojo::DOM来识别和打印我从 Movable Type 内容管理系统中的现有内容中提取的数百个 HTML 文档中的短语（意思是选定 HTML 标记之间的文本字符串）。

我正在将这些短语写到一个文件中，以便将它们翻译成其他语言，如下所示：

        $dom = Mojo::DOM->new(Mojo::Util::decode('UTF-8', $page->text));

    ##########
    #
    # Break down the Body into phrases. This is done by listing the tags and tag combinations that
    # surround each block of text that we're looking to capture.
    #
    ##########

        print FILE "\n\t### Body\n\n";        

        for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->map('text')->each ) {

            print_phrase($phrase); # utility function to write out the phrase to a file

        }

Run Code Online (Sandbox Code Playgroud)

当 Mojo::DOM 遇到嵌入的 HTML 实体（例如™和 ）时，它会将这些实体转换为编码字符，而不是按照书面形式传递。我希望实体按书面方式传递。

我意识到我可以使用 Mojo::Util::decode 将这些 HTML 实体传递给我正在编写的文件。问题是“您只能在包含有效 UTF-8 的字符串上调用 decode 'UTF-8'。如果不这样做，例如因为它已经转换为 Perl 字符，它将返回 undef。”

如果是这种情况，我必须在调用之前尝试弄清楚如何测试当前 HTML 页面的编码Mojo::Util::decode('UTF-8', $page->text)，或者我必须使用其他一些技术来保留编码的 HTML 实体。

使用 Mojo::DOM 处理 HTML 文档时，如何最可靠地保留编码的 HTML 实体？

Answer 1

Dav*_*llo 0

通过测试，我和我的同事能够确定Mojo::DOM->new()自动解码与字符 ( &)，从而使 HTML 实体的保存变得不可能。为了解决这个问题，我们添加了以下子例程来对＆符号进行双重编码：

sub encode_amp {
    my ($text) = @_;

    ##########
    #
    # We discovered that we need to encode ampersand
    # characters being passed into Mojo::DOM->new() to avoid HTML entities being decoded
    # automatically by Mojo::DOM::Util::html_unescape().
    #
    # What we're doing is calling $dom = Mojo::DOM->new(encode_amp($string)) which double encodes
    # any incoming ampersand or &amp; characters.
    #
    #
    ##########   

    $text .= '';           # Suppress uninitialized value warnings
    $text =~ s!&!&amp;!g;  # HTML encode ampersand characters
    return $text;
}

Run Code Online (Sandbox Code Playgroud)

稍后在脚本中，我们实例化一个新对象时会$page->text经过。encode_amp()Mojo::DOM

    $dom = Mojo::DOM->new(encode_amp($page->text));

##########
#
# Break down the Body into phrases. This is done by listing the tags and tag combinations that
# surround each block of text that we're looking to capture.
#
# Note that "h2 b" is an important tag combination for capturing major headings on pages
# in this theme. The tags "span" and "a" are also.
#
# We added caption and th to support tables.
#
# We added li and li a to support ol (ordered lists) and ul (unordered lists).
#
# We got the complicated map('descendant_nodes') logic from @Grinnz on StackOverflow, see:
# /sf/ask/3859161001/#comment97006305_55131737
#
#
# Original set of selectors in $dom->find() below is as follows:
#   'h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a'
#
##########

    print FILE "\n\t### Body\n\n";        

    for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->
        map('descendant_nodes')->map('each')->grep(sub { $_->type eq 'text' })->map('content')->uniq->each ) {           

        print_phrase($phrase);

    }

Run Code Online (Sandbox Code Playgroud)

上面的代码块包含了 @Grinnz 之前的建议，如本问题的评论中所示。还要感谢@Robert 的回答，他对如何Mojo::DOM工作有很好的观察。

这段代码绝对适用于我的应用程序。

归档时间：	6 年，8 月前
查看次数：	272 次
最近记录：	6 年，7 月前