我有一个像这样的HTML字符串:
<img src="http://foo"><img src="http://bar">
Run Code Online (Sandbox Code Playgroud)
将此分成两个独立的img标签的正则表达式模式是什么?
如何确定是你,你的字符串是究竟是什么?输入如下:
<img alt=">"          src="http://foo"  >
<img src='http://bar' alt='<'           >
Run Code Online (Sandbox Code Playgroud)
这是什么编程语言?您是否有一些理由不使用标准的HTML解析类来处理这个问题?当您拥有一组非常着名的输入时,正则表达式只是一种很好的方法.它们不适用于真正的HTML,仅适用于操纵演示.
即使你必须使用正则表达式,你也应该使用正确的语法.这很容易.我已在万亿网页上测试了以下程序.它照顾我上面概述的案例 - 以及一两个其他案例.
#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;
my $img_rx = qr{
    # save capture in $+{TAG} variable
    (?<TAG> (?&image_tag) )
    # remainder is pure declaration
    (?(DEFINE)
        (?<image_tag>
            (?&start_tag)
            (?&might_white) 
            (?&attributes) 
            (?&might_white) 
            (?&end_tag)
        )
        (?<attributes>
            (?: 
                (?&might_white) 
                (?&one_attribute) 
            ) *
        )
        (?<one_attribute>
            \b
            (?&legal_attribute)
            (?&might_white) = (?&might_white) 
            (?:
                (?"ed_value)
              | (?&unquoted_value)
            )
        )
        (?<legal_attribute> 
            (?: (?&required_attribute)
              | (?&optional_attribute)
              | (?&standard_attribute)
              | (?&event_attribute)
              # for LEGAL parse only, comment out next line 
              | (?&illegal_attribute)
            )
        )
        (?<illegal_attribute> \b \w+ \b )
        (?<required_attribute>
            alt
          | src
        )
        (?<optional_attribute>
            (?&permitted_attribute)
          | (?&deprecated_attribute)
        )
        # NB: The white space in string literals 
        #     below DOES NOT COUNT!   It's just 
        #     there for legibility.
        (?<permitted_attribute>
            height
          | is map
          | long desc
          | use map
          | width
        )
        (?<deprecated_attribute>
             align
           | border
           | hspace
           | vspace
        )
        (?<standard_attribute>
            class
          | dir
          | id
          | style
          | title
          | xml:lang
        )
        (?<event_attribute>
            on abort
          | on click
          | on dbl click
          | on mouse down
          | on mouse out
          | on key down
          | on key press
          | on key up
        )
        (?<unquoted_value> 
            (?&unwhite_chunk) 
        )
        (?<quoted_value>
            (?<quote>   ["']      )
            (?: (?! \k<quote> ) . ) *
            \k<quote> 
        )
        (?<unwhite_chunk>   
            (?:
                # (?! [<>'"] ) 
                (?! > ) 
                \S
            ) +   
        )
        (?<might_white>     \s *   )
        (?<start_tag>  
            < (?&might_white) 
            img 
            \b       
        )
        (?<end_tag>          
            (?&html_end_tag)
          | (?&xhtml_end_tag)
        )
        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )
    )
}six;
$/ = undef;
$_ = <>;   # read all input
# strip stuff we aren't supposed to look at
s{ <!    DOCTYPE  .*?         > }{}sx; 
s{ <! \[ CDATA \[ .*?    \]\] > }{}gsx; 
s{ <script> .*?  </script> }{}gsix; 
s{ <!--     .*?        --> }{}gsx;
my $count = 0;
while (/$img_rx/g) {
    printf "Match %d at %d: %s\n", 
            ++$count, pos(), $+{TAG};
} 
Run Code Online (Sandbox Code Playgroud)
你去吧 什么都没有!
哎呀,为什么你会永远想使用的HTML解析类,给出了如何轻松地HTML可以在正则表达式来处理.☺