正则表达式拆分HTML标签

arr*_*obe 2 regex

我有一个像这样的HTML字符串:

<img src="http://foo"><img src="http://bar">
Run Code Online (Sandbox Code Playgroud)

将此分成两个独立的img标签的正则表达式模式是什么?

tch*_*ist 7

如何确定是你,你的字符串是究竟是什么?输入如下:

<img alt=">"          src="http://foo"  >
<img src='http://bar' alt='<'           >
Run Code Online (Sandbox Code Playgroud)

这是什么编程语言?您是否有一些理由不使用标准的HTML解析类来处理这个问题?当您拥有一组非常着名的输入时,正则表达式只是一种很好的方法.它们不适用于真正的HTML,仅适用于操纵演示.

即使你必须使用正则表达式,你也应该使用正确的语法.这很容易.我已在万亿网页上测试了以下程序.它照顾我上面概述的案例 - 以及一两个其他案例.

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;

my $img_rx = qr{

    # save capture in $+{TAG} variable
    (?<TAG> (?&image_tag) )

    # remainder is pure declaration
    (?(DEFINE)

        (?<image_tag>
            (?&start_tag)
            (?&might_white) 
            (?&attributes) 
            (?&might_white) 
            (?&end_tag)
        )

        (?<attributes>
            (?: 
                (?&might_white) 
                (?&one_attribute) 
            ) *
        )

        (?<one_attribute>
            \b
            (?&legal_attribute)
            (?&might_white) = (?&might_white) 
            (?:
                (?&quoted_value)
              | (?&unquoted_value)
            )
        )

        (?<legal_attribute> 
            (?: (?&required_attribute)
              | (?&optional_attribute)
              | (?&standard_attribute)
              | (?&event_attribute)
              # for LEGAL parse only, comment out next line 
              | (?&illegal_attribute)
            )
        )

        (?<illegal_attribute> \b \w+ \b )

        (?<required_attribute>
            alt
          | src
        )

        (?<optional_attribute>
            (?&permitted_attribute)
          | (?&deprecated_attribute)
        )

        # NB: The white space in string literals 
        #     below DOES NOT COUNT!   It's just 
        #     there for legibility.

        (?<permitted_attribute>
            height
          | is map
          | long desc
          | use map
          | width
        )

        (?<deprecated_attribute>
             align
           | border
           | hspace
           | vspace
        )

        (?<standard_attribute>
            class
          | dir
          | id
          | style
          | title
          | xml:lang
        )

        (?<event_attribute>
            on abort
          | on click
          | on dbl click
          | on mouse down
          | on mouse out
          | on key down
          | on key press
          | on key up
        )

        (?<unquoted_value> 
            (?&unwhite_chunk) 
        )

        (?<quoted_value>
            (?<quote>   ["']      )
            (?: (?! \k<quote> ) . ) *
            \k<quote> 
        )

        (?<unwhite_chunk>   
            (?:
                # (?! [<>'"] ) 
                (?! > ) 
                \S
            ) +   
        )

        (?<might_white>     \s *   )

        (?<start_tag>  
            < (?&might_white) 
            img 
            \b       
        )

        (?<end_tag>          
            (?&html_end_tag)
          | (?&xhtml_end_tag)
        )

        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )

    )

}six;

$/ = undef;
$_ = <>;   # read all input

# strip stuff we aren't supposed to look at
s{ <!    DOCTYPE  .*?         > }{}sx; 
s{ <! \[ CDATA \[ .*?    \]\] > }{}gsx; 

s{ <script> .*?  </script> }{}gsix; 
s{ <!--     .*?        --> }{}gsx;

my $count = 0;

while (/$img_rx/g) {
    printf "Match %d at %d: %s\n", 
            ++$count, pos(), $+{TAG};
} 
Run Code Online (Sandbox Code Playgroud)

你去吧 什么都没有!

哎呀,为什么你会永远想使用的HTML解析类,给出了如何轻松地HTML可以在正则表达式来处理.☺


Viv*_*ath 5

不要用正则表达式来做.使用HTML/XML解析器.你甚至可以先通过Tidy来清理它.大多数语言都有一个Tidy库.你用的是哪种语言?