使用 Mojo::DOM 处理 HTML 文档时,如何最可靠地保留 HTML 实体?

Dav*_*llo 6 perl movabletype html-entities mojolicious

我正在使用Mojo::DOM来识别和打印我从 Movable Type 内容管理系统中的现有内容中提取的数百个 HTML 文档中的短语(意思是选定 HTML 标记之间的文本字符串)。

我正在将这些短语写到一个文件中,以便将它们翻译成其他语言,如下所示:

        $dom = Mojo::DOM->new(Mojo::Util::decode('UTF-8', $page->text));

    ##########
    #
    # Break down the Body into phrases. This is done by listing the tags and tag combinations that
    # surround each block of text that we're looking to capture.
    #
    ##########

        print FILE "\n\t### Body\n\n";        

        for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->map('text')->each ) {

            print_phrase($phrase); # utility function to write out the phrase to a file

        }
Run Code Online (Sandbox Code Playgroud)

当 Mojo::DOM 遇到嵌入的 HTML 实体(例如™ )时,它会将这些实体转换为编码字符,而不是按照书面形式传递。我希望实体按书面方式传递。

我意识到我可以使用 Mojo::Util::decode 将这些 HTML 实体传递给我正在编写的文件。问题是“您只能在包含有效 UTF-8 的字符串上调用 decode 'UTF-8'。如果不这样做,例如因为它已经转换为 Perl 字符,它将返回 undef。”

如果是这种情况,我必须在调用 之前尝试弄清楚如何测试当前 HTML 页面的编码Mojo::Util::decode('UTF-8', $page->text),或者我必须使用其他一些技术来保留编码的 HTML 实体。

使用 Mojo::DOM 处理 HTML 文档时,如何最可靠地保留编码的 HTML 实体?

Dav*_*llo 0

通过测试,我和我的同事能够确定Mojo::DOM->new()自动解码与字符 ( &),从而使 HTML 实体的保存变得不可能。为了解决这个问题,我们添加了以下子例程来对&符号进行双重编码:

sub encode_amp {
    my ($text) = @_;

    ##########
    #
    # We discovered that we need to encode ampersand
    # characters being passed into Mojo::DOM->new() to avoid HTML entities being decoded
    # automatically by Mojo::DOM::Util::html_unescape().
    #
    # What we're doing is calling $dom = Mojo::DOM->new(encode_amp($string)) which double encodes
    # any incoming ampersand or & characters.
    #
    #
    ##########   

    $text .= '';           # Suppress uninitialized value warnings
    $text =~ s!&!&!g;  # HTML encode ampersand characters
    return $text;
}
Run Code Online (Sandbox Code Playgroud)

稍后在脚本中,我们实例化一个新对象时会$page->text经过。encode_amp()Mojo::DOM

    $dom = Mojo::DOM->new(encode_amp($page->text));

##########
#
# Break down the Body into phrases. This is done by listing the tags and tag combinations that
# surround each block of text that we're looking to capture.
#
# Note that "h2 b" is an important tag combination for capturing major headings on pages
# in this theme. The tags "span" and "a" are also.
#
# We added caption and th to support tables.
#
# We added li and li a to support ol (ordered lists) and ul (unordered lists).
#
# We got the complicated map('descendant_nodes') logic from @Grinnz on StackOverflow, see:
# /sf/ask/3859161001/#comment97006305_55131737
#
#
# Original set of selectors in $dom->find() below is as follows:
#   'h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a'
#
##########

    print FILE "\n\t### Body\n\n";        

    for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->
        map('descendant_nodes')->map('each')->grep(sub { $_->type eq 'text' })->map('content')->uniq->each ) {           

        print_phrase($phrase);

    }
Run Code Online (Sandbox Code Playgroud)

上面的代码块包含了 @Grinnz 之前的建议,如本问题的评论中所示。还要感谢@Robert 的回答,他对如何Mojo::DOM工作有很好的观察。

这段代码绝对适用于我的应用程序。