Dav*_*llo 6 perl movabletype html-entities mojolicious
我正在使用Mojo::DOM来识别和打印我从 Movable Type 内容管理系统中的现有内容中提取的数百个 HTML 文档中的短语(意思是选定 HTML 标记之间的文本字符串)。
我正在将这些短语写到一个文件中,以便将它们翻译成其他语言,如下所示:
$dom = Mojo::DOM->new(Mojo::Util::decode('UTF-8', $page->text));
##########
#
# Break down the Body into phrases. This is done by listing the tags and tag combinations that
# surround each block of text that we're looking to capture.
#
##########
print FILE "\n\t### Body\n\n";
for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->map('text')->each ) {
print_phrase($phrase); # utility function to write out the phrase to a file
}
Run Code Online (Sandbox Code Playgroud)
当 Mojo::DOM 遇到嵌入的 HTML 实体(例如™和 )时,它会将这些实体转换为编码字符,而不是按照书面形式传递。我希望实体按书面方式传递。
我意识到我可以使用 Mojo::Util::decode 将这些 HTML 实体传递给我正在编写的文件。问题是“您只能在包含有效 UTF-8 的字符串上调用 decode 'UTF-8'。如果不这样做,例如因为它已经转换为 Perl 字符,它将返回 undef。”
如果是这种情况,我必须在调用 之前尝试弄清楚如何测试当前 HTML 页面的编码Mojo::Util::decode('UTF-8', $page->text),或者我必须使用其他一些技术来保留编码的 HTML 实体。
使用 Mojo::DOM 处理 HTML 文档时,如何最可靠地保留编码的 HTML 实体?
通过测试,我和我的同事能够确定Mojo::DOM->new()自动解码与字符 ( &),从而使 HTML 实体的保存变得不可能。为了解决这个问题,我们添加了以下子例程来对&符号进行双重编码:
sub encode_amp {
my ($text) = @_;
##########
#
# We discovered that we need to encode ampersand
# characters being passed into Mojo::DOM->new() to avoid HTML entities being decoded
# automatically by Mojo::DOM::Util::html_unescape().
#
# What we're doing is calling $dom = Mojo::DOM->new(encode_amp($string)) which double encodes
# any incoming ampersand or & characters.
#
#
##########
$text .= ''; # Suppress uninitialized value warnings
$text =~ s!&!&!g; # HTML encode ampersand characters
return $text;
}
Run Code Online (Sandbox Code Playgroud)
稍后在脚本中,我们实例化一个新对象时会$page->text经过。encode_amp()Mojo::DOM
$dom = Mojo::DOM->new(encode_amp($page->text));
##########
#
# Break down the Body into phrases. This is done by listing the tags and tag combinations that
# surround each block of text that we're looking to capture.
#
# Note that "h2 b" is an important tag combination for capturing major headings on pages
# in this theme. The tags "span" and "a" are also.
#
# We added caption and th to support tables.
#
# We added li and li a to support ol (ordered lists) and ul (unordered lists).
#
# We got the complicated map('descendant_nodes') logic from @Grinnz on StackOverflow, see:
# /sf/ask/3859161001/#comment97006305_55131737
#
#
# Original set of selectors in $dom->find() below is as follows:
# 'h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a'
#
##########
print FILE "\n\t### Body\n\n";
for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->
map('descendant_nodes')->map('each')->grep(sub { $_->type eq 'text' })->map('content')->uniq->each ) {
print_phrase($phrase);
}
Run Code Online (Sandbox Code Playgroud)
上面的代码块包含了 @Grinnz 之前的建议,如本问题的评论中所示。还要感谢@Robert 的回答,他对如何Mojo::DOM工作有很好的观察。
这段代码绝对适用于我的应用程序。
| 归档时间: |
|
| 查看次数: |
272 次 |
| 最近记录: |