怎么做正则表达式模式与字符串中的任何地方都不匹配?

Sal*_*man 176 html regex parsing

我正在尝试<input>使用此模式匹配类型"隐藏"字段:

/<input type="hidden" name="([^"]*?)" value="([^"]*?)" />/
Run Code Online (Sandbox Code Playgroud)

这是示例表单数据:

<input type="hidden" name="SaveRequired" value="False" /><input type="hidden" name="__VIEWSTATE1" value="1H4sIAAtzrkX7QfL5VEGj6nGi+nP" /><input type="hidden" name="__VIEWSTATE2" value="0351118MK" /><input type="hidden" name="__VIEWSTATE3" value="ZVVV91yjY" /><input type="hidden" name="__VIEWSTATE0" value="3" /><input type="hidden" name="__VIEWSTATE" value="" /><input type="hidden" name="__VIEWSTATE" value="" />
Run Code Online (Sandbox Code Playgroud)

但我不知道的type,namevalue属性将始终出现在相同的顺序.如果type属性是最后一个,则匹配将失败,因为在我的模式中,它在开始时.

问题:
如何更改模式以使其匹配,无论<input>标签中属性的位置如何?

PS:顺便说一下,我正在使用基于Adobe AirRegEx桌面工具来测试正则表达式.

tch*_*ist 675

Oh Yes You Can Use Regexes to Parse HTML!

For the task you are attempting, regexes are perfectly fine!

It is true that most people underestimate the difficulty of parsing HTML with regular expressions and therefore do so poorly.

But this is not some fundamental flaw related to computational theory. That silliness is parroted a lot around here, but don’t you believe them.

So while it certainly can be done (this posting serves as an existence proof of this incontrovertible fact), that doesn’t mean it should be.

You must decide for yourself whether you’re up to the task of writing what amounts to a dedicated, special-purpose HTML parser out of regexes. Most people are not.

But I am. ?


General Regex-Based HTML Parsing Solutions

First I’ll show how easy it is to parse arbitrary HTML with regexes. The full program’s at the end of this posting, but the heart of the parser is:

for (;;) {
  given ($html) {
    last                    when (pos || 0) >= length;
    printf "\@%d=",              (pos || 0);
    print  "doctype "   when / \G (?&doctype)  $RX_SUBS  /xgc;
    print  "cdata "     when / \G (?&cdata)    $RX_SUBS  /xgc;
    print  "xml "       when / \G (?&xml)      $RX_SUBS  /xgc;
    print  "xhook "     when / \G (?&xhook)    $RX_SUBS  /xgc;
    print  "script "    when / \G (?&script)   $RX_SUBS  /xgc;
    print  "style "     when / \G (?&style)    $RX_SUBS  /xgc;
    print  "comment "   when / \G (?&comment)  $RX_SUBS  /xgc;
    print  "tag "       when / \G (?&tag)      $RX_SUBS  /xgc;
    print  "untag "     when / \G (?&untag)    $RX_SUBS  /xgc;
    print  "nasty "     when / \G (?&nasty)    $RX_SUBS  /xgc;
    print  "text "      when / \G (?&nontag)   $RX_SUBS  /xgc;
    default {
      die "UNCLASSIFIED: " .
        substr($_, pos || 0, (length > 65) ? 65 : length);
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

See how easy that is to read?

As written, it identifies each piece of HTML and tells where it found that piece. You could easily modify it to do whatever else you want with any given type of piece, or for more particular types than these.

I have no failing test cases (left :): I’ve successfully run this code on more than 100,000 HTML files — every single one I could quickly and easily get my hands on. Beyond those, I’ve also run it on files specifically constructed to break naïve parsers.

This is not a naïve parser.

Oh, I’m sure it isn’t perfect, but I haven’t managed to break it yet. I figure that even if something did, the fix would be easy to fit in because of the program’s clear structure. Even regex-heavy programs should have stucture.

Now that that’s out of the way, let me address the OP’s question.

Demo of Solving the OP’s Task Using Regexes

The little html_input_rx program I include below produces the following output, so that you can see that parsing HTML with regexes works just fine for what you wish to do:

% html_input_rx Amazon.com-_Online_Shopping_for_Electronics,_Apparel,_Computers,_Books,_DVDs_\&_more.htm 
input tag #1 at character 9955:
       class => "searchSelect"
          id => "twotabsearchtextbox"
        name => "field-keywords"
        size => "50"
       style => "width:100%; background-color: #FFF;"
       title => "Search for"
        type => "text"
       value => ""

input tag #2 at character 10335:
         alt => "Go"
         src => "http://g-ecx.images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V192234675_.gif"
        type => "image"
Run Code Online (Sandbox Code Playgroud)

Parse Input Tags, See No Evil Input

Here’s the source for the program that produced the output above.

#!/usr/bin/env perl
#
# html_input_rx - pull out all <input> tags from (X)HTML src
#                  via simple regex processing
#
# Tom Christiansen <tchrist@perl.com>
# Sat Nov 20 10:17:31 MST 2010
#
################################################################

use 5.012;

use strict;
use autodie;
use warnings FATAL => "all";    
use subs qw{
    see_no_evil
    parse_input_tags
    input descape dequote
    load_patterns
};    
use open        ":std",
          IN => ":bytes",
         OUT => ":utf8";    
use Encode qw< encode decode >;

    ###########################################################

                        parse_input_tags 
                           see_no_evil 
                              input  

    ###########################################################

until eof(); sub parse_input_tags {
    my $_ = shift();
    our($Input_Tag_Rx, $Pull_Attr_Rx);
    my $count = 0;
    while (/$Input_Tag_Rx/pig) {
        my $input_tag = $+{TAG};
        my $place     = pos() - length ${^MATCH};
        printf "input tag #%d at character %d:\n", ++$count, $place;
        my %attr = ();
        while ($input_tag =~ /$Pull_Attr_Rx/g) {
            my ($name, $value) = @+{ qw< NAME VALUE > };
            $value = dequote($value);
            if (exists $attr{$name}) {
                printf "Discarding dup attr value '%s' on %s attr\n",
                    $attr{$name} // "<undef>", $name;
            } 
            $attr{$name} = $value;
        } 
        for my $name (sort keys %attr) {
            printf "  %10s => ", $name;
            my $value = descape $attr{$name};
            my  @Q; given ($value) {
                @Q = qw[  " "  ]  when !/'/ && !/"/;
                @Q = qw[  " "  ]  when  /'/ && !/"/;
                @Q = qw[  ' '  ]  when !/'/ &&  /"/;
                @Q = qw[ q( )  ]  when  /'/ &&  /"/;
                default { die "NOTREACHED" }
            } 
            say $Q[0], $value, $Q[1];
        } 
        print "\n";
    } 

}

sub dequote {
    my $_ = $_[0];
    s{
        (?<quote>   ["']      )
        (?<BODY>    
          (?s: (?! \k<quote> ) . ) * 
        )
        \k<quote> 
    }{$+{BODY}}six;
    return $_;
} 

sub descape {
    my $string = $_[0];
    for my $_ ($string) {
        s{
            (?<! % )
            % ( \p{Hex_Digit} {2} )
        }{
            chr hex $1;
        }gsex;
        s{
            & \043 
            ( [0-9]+ )
            (?: ; 
              | (?= [^0-9] )
            )
        }{
            chr     $1;
        }gsex;
        s{
            & \043 x
            ( \p{ASCII_HexDigit} + )
            (?: ; 
              | (?= \P{ASCII_HexDigit} )
            )
        }{
            chr hex $1;
        }gsex;

    }
    return $string;
} 

sub input { 
    our ($RX_SUBS, $Meta_Tag_Rx);
    my $_ = do { local $/; <> };  
    my $encoding = "iso-8859-1";  # web default; wish we had the HTTP headers :(
    while (/$Meta_Tag_Rx/gi) {
        my $meta = $+{META};
        next unless $meta =~ m{             $RX_SUBS
            (?= http-equiv ) 
            (?&name) 
            (?&equals) 
            (?= (?&quote)? content-type )
            (?&value)    
        }six;
        next unless $meta =~ m{             $RX_SUBS
            (?= content ) (?&name) 
                          (?&equals) 
            (?<CONTENT>   (?&value)    )
        }six;
        next unless $+{CONTENT} =~ m{       $RX_SUBS
            (?= charset ) (?&name) 
                          (?&equals) 
            (?<CHARSET>   (?&value)    )
        }six;
        if (lc $encoding ne lc $+{CHARSET}) {
            say "[RESETTING ENCODING $encoding => $+{CHARSET}]";
            $encoding = $+{CHARSET};
        }
    } 
    return decode($encoding, $_);
}

sub see_no_evil {
    my $_ = shift();

    s{ <!    DOCTYPE  .*?         > }{}sx; 
    s{ <! \[ CDATA \[ .*?    \]\] > }{}gsx; 

    s{ <script> .*?  </script> }{}gsix; 
    s{ <!--     .*?        --> }{}gsx;

    return $_;
}

sub load_patterns { 

    our $RX_SUBS = qr{ (?(DEFINE)
        (?<nv_pair>         (?&name) (?&equals) (?&value)         ) 
        (?<name>            \b (?=  \pL ) [\w\-] + (?<= \pL ) \b  )
        (?<equals>          (?&might_white)  = (?&might_white)    )
        (?<value>           (?&quoted_value) | (?&unquoted_value) )
        (?<unwhite_chunk>   (?: (?! > ) \S ) +                    )
        (?<unquoted_value>  [\w\-] *                              )
        (?<might_white>     \s *                                  )
        (?<quoted_value>
            (?<quote>   ["']      )
            (?: (?! \k<quote> ) . ) *
            \k<quote> 
        )
        (?<start_tag>  < (?&might_white) )
        (?<end_tag>          
            (?&might_white)
            (?: (?&html_end_tag) 
              | (?&xhtml_end_tag) 
             )
        )
        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )
    ) }six; 

    our $Meta_Tag_Rx = qr{                          $RX_SUBS 
        (?<META> 
            (?&start_tag) meta \b
            (?:
                (?&might_white) (?&nv_pair) 
            ) +
            (?&end_tag)
        )
    }six;

    our $Pull_Attr_Rx = qr{                         $RX_SUBS
        (?<NAME>  (?&name)      )
                  (?&equals) 
        (?<VALUE> (?&value)     )
    }six;

    our $Input_Tag_Rx = qr{                         $RX_SUBS 

        (?<TAG> (?&input_tag) )

        (?(DEFINE)

            (?<input_tag>
                (?&start_tag)
                input
                (?&might_white) 
                (?&attributes) 
                (?&might_white) 
                (?&end_tag)
            )

            (?<attributes>
                (?: 
                    (?&might_white) 
                    (?&one_attribute) 
                ) *
            )

            (?<one_attribute>
                \b
                (?&legal_attribute)
                (?&might_white) = (?&might_white) 
                (?:
                    (?&quoted_value)
                  | (?&unquoted_value)
                )
            )

            (?<legal_attribute> 
                (?: (?&optional_attribute)
                  | (?&standard_attribute)
                  | (?&event_attribute)
            # for LEGAL parse only, comment out next line 
                  | (?&illegal_attribute)
                )
            )

            (?<illegal_attribute>  (?&name) )

            (?<required_attribute> (?#no required attributes) )

            (?<optional_attribute>
                (?&permitted_attribute)
              | (?&deprecated_attribute)
            )

            # NB: The white space in string literals 
            #     below DOES NOT COUNT!   It's just 
            #     there for legibility.

            (?<permitted_attribute>
                  accept
                | alt
                | bottom
                | check box
                | checked
                | disabled
                | file
                | hidden
                | image
                | max length
                | middle
                | name
                | password
                | radio
                | read only
                | reset
                | right
                | size
                | src
                | submit
                | text
                | top
                | type
                | value
            )

            (?<deprecated_attribute>
                  align
            )

            (?<standard_attribute>
                  access key
                | class
                | dir
                | ltr
                | id
                | lang
                | style
                | tab index
                | title
                | xml:lang
            )

            (?<event_attribute>
                  on blur
                | on change
                | on click
                | on dbl   click
                | on focus
                | on mouse down
                | on mouse move
                | on mouse out
                | on mouse over
                | on mouse up
                | on key   down
                | on key   press
                | on key   up
                | on select
            )
        )
    }six;

}

UNITCHECK {
    load_patterns();
} 

END {
    close(STDOUT) 
        || die "can't close stdout: $!";
} 
Run Code Online (Sandbox Code Playgroud)

There you go! Nothing to it! :)

Only you can judge whether your skill with regexes is up to any particular parsing task. Everyone’s level of skill is different, and every new task is different. For jobs where you have a well-defined input set, regexes are obviously the right choice, because it is trivial to put some together when you have a restricted subset of HTML to deal with. Even regex beginners should be handle those jobs with regexes. Anything else is overkill.

However, once the HTML starts becoming less nailed down, once it starts to ramify in ways you cannot predict but which are perfectly legal, once you have to match more different sorts of things or with more complex dependencies, you will eventually reach a point where you have to work harder to effect a solution that uses regexes than you would have to using a parsing class. Where that break-even point falls depends again on your own comfort level with regexes.

So What Should I Do?

我不会告诉你什么,你必须这样做,或者你有什么不能做.我认为那是错的.我只是想向你展示可能性,睁开你的眼睛.您可以选择要执行的操作以及执行操作的方式.没有绝对的 - 没有其他人知道你自己的情况,就像你自己一样.如果有什么东西看起来像是太多的工作,那么也许就是这样.你知道,编程应该很有趣.如果不是,你可能做错了.

One can look at my html_input_rx program in any number of valid ways. One such is that you indeed can parse HTML with regular expressions. But another is that it is much, much, much harder than almost anyone ever thinks it is. This can easily lead to the conclusion that my program is a testament to what you should not do, because it really is too hard.

I won’t disagree with that. Certainly if everything I do in my program doesn’t make sense to you after some study, then you should not be attempting to use regexes for this kind of task. For specific HTML, regexes are great, but for generic HTML, they’re tantamount to madness. I use parsing classes all the time, especially if it’s HTML I haven’t generated myself.

Regexes optimal for small HTML parsing problems, pessimal for large ones

Even if my program is taken as illustrative of why you should not use regexes for parsing general HTML — which is OK, because I kinda meant for it to be that ? — it still should be an eye-opener so more people break the terribly common and nasty, nasty habit of writing unreadable, unstructured, and unmaintainable patterns.

Patterns do not have to be ugly, and they do not have to be hard. If you create ugly patterns, it is a reflection on you, not them.

Phenomenally Exquisite Regex Language

I’ve been asked to point out that my proferred solution to your problem has been written in Perl. Are you surprised? Did you not notice? Is this revelation a bombshell?

I must confess that I find this request bizarre in the extreme, since anybody who can’t figure that out from looking at the very first line of my program surely has other mental disabilities as well.

It is true that not all other tools and programming languages are quite as convenient, expressive, and powerful when it comes to regexes as Perl is. There’s a big spectrum out there, with some being more suitable than others. In general, the languages that have expressed regexes as part of the core language instead of as a library are easier to work with. I’ve done nothing with regexes that you couldn’t do in, say, PCRE, although you would structure the program differently if you were using C.

Eventually other languages will be catch up with where Perl is now in terms of regexes. I say this because back when Perl started, nobody else had anything like Perl’s regexes. Say anything you like, but this is where Perl clearly won: everybody copied Perl’s regexes albeit at varying stages of their development. Perl pioneered almost (not quite all, but almost) everything that you have come to rely on in modern patterns today, no matter what tool or language you use. So eventually the others will catch up.

But they’ll only catch up to where Perl was sometime in the past, just as it is now. Everything advances. In regexes if nothing else, where Perl leads, others follow. Where will Perl be once everybody else finally catches up to where Perl is now? I have no idea, but I know we too will have moved. Probably we’ll be closer to Perl?’s style of crafting patterns.

If you like that kind of thing but would like to use it in Perl?, you might be interested in Damian Conway’s wonderful Regexp::Grammars module. It’s completely awesome, and makes what I’ve done here in my program seem just as primitive as mine makes the patterns that people cram together without whitespace or alphabetic identifiers. Check it out!


Simple HTML Chunker

Here is the complete source to the parser I showed the centerpiece from at the beginning of this posting.

I am not suggesting that you should use this over a rigorously tested parsing class. But I am tired of people pretending that nobody can parse HTML with regexes just because they can’t. You clearly can, and this program is proof of that assertion.

Sure, it isn’t easy, but it is possible!

And trying to do so is a terrible waste of time, because good parsing classes exist which you should use for this task. The right answer to people trying to parse arbitrary HTML is not that it is impossible. That is a facile and disingenuous answer. The correct and honest answer is that they shouldn’t attempt it because it is too much of a bother to figure out from scratch; they should not break their back striving to reïnvent a wheel that works perfectly well.

On the other hand, HTML that falls within a predicable subset is ultra-easy to parse with regexes. It’s no wonder people try to use them, because for small problems, toy problems perhaps, nothing could be easier. That’s why it’s so important to distinguish the two tasks — specific vs generic — as these do not necessarily demand the same approach.

I hope in the future here to see a more fair and honest treatment of questions about HTML and regexes.

Here’s my HTML lexer. It doesn’t try to do a validating parse; it just identifies the lexical elements. You might think of it more as an HTML chunker than an HTML parser. It isn’t very forgiving of broken HTML, although it makes some very small allowances in that direction.

Even if you never parse full HTML yourself (and why should you? it’s a solved problem!), this program has lots of cool regex bits that I believe a lot of people can learn a lot from. Enjoy!

#!/usr/bin/env perl
#
# chunk_HTML - a regex-based HTML chunker
#
# Tom Christiansen <tchrist@perl.com
#   Sun Nov 21 19:16:02 MST 2010
########################################

use 5.012;

use strict;
use autodie;
use warnings qw< FATAL all >;
use open     qw< IN :bytes OUT :utf8 :std >;

MAIN: {
  $| = 1;
  lex_html(my $page = slurpy());
  exit();
}

########################################################################
sub lex_html {
    our $RX_SUBS;                                        ###############
    my  $html = shift();                                 # Am I...     #
    for (;;) {                                           # forgiven? :)#
        given ($html) {                                  ###############
            last                when (pos || 0) >= length;
            printf "\@%d=",          (pos || 0);
            print  "doctype "   when / \G (?&doctype)  $RX_SUBS  /xgc;
            print  "cdata "     when / \G (?&cdata)    $RX_SUBS  /xgc;
            print  "xml "       when / \G (?&xml)      $RX_SUBS  /xgc;
            print  "xhook "     when / \G (?&xhook)    $RX_SUBS  /xgc;
            print  "script "    when / \G (?&script)   $RX_SUBS  /xgc;
            print  "style "     when / \G (?&style)    $RX_SUBS  /xgc;
            print  "comment "   when / \G (?&comment)  $RX_SUBS  /xgc;
            print  "tag "       when / \G (?&tag)      $RX_SUBS  /xgc;
            print  "untag "     when / \G (?&untag)    $RX_SUBS  /xgc;
            print  "nasty "     when / \G (?&nasty)    $RX_SUBS  /xgc;
            print  "text "      when / \G (?&nontag)   $RX_SUBS  /xgc;
            default {
                die "UNCLASSIFIED: " .
                  substr($_, pos || 0, (length > 65) ? 65 : length);
            }
        }
    }
    say ".";
}
#####################
# Return correctly decoded contents of next complete
# file slurped in from the <ARGV> stream.
#
sub slurpy {
    our ($RX_SUBS, $Meta_Tag_Rx);
    my $_ = do { local $/; <ARGV> };   # read all input

    return unless length;

    use Encode   qw< decode >;

    my $bom = "";
    given ($_) {
        $bom = "UTF-32LE" when / ^ \xFf \xFe \0   \0   /x;  # LE
        $bom = "UTF-32BE" when / ^ \0   \0   \xFe \xFf /x;  #   BE
        $bom = "UTF-16LE" when / ^ \xFf \xFe           /x;  # le
        $bom = "UTF-16BE" when / ^ \xFe \xFf           /x;  #   be
        $bom = "UTF-8"    when / ^ \xEF \xBB \xBF      /x;  # st00pid
    }
    if ($bom) {
        say "[BOM $bom]";
        s/^...// if $bom eq "UTF-8";                        # st00pid

        # Must use UTF-(16|32) w/o -[BL]E to strip BOM.
        $bom =~ s/-[LB]E//;

        return decode($bom, $_);

        # if BOM found, don't fall through to look
        #  for embedded encoding spec
    }

    # Latin1 is web default if not otherwise specified.
    # No way to do this correctly if it was overridden
    # in the HTTP header, since we assume stream contains
    # HTML only, not also the HTTP header.
    my $encoding = "iso-8859-1";
    while (/ (?&xml) $RX_SUBS /pgx) {
        my $xml = ${^MATCH};
        next unless $xml =~ m{              $RX_SUBS
            (?= encoding )  (?&name)
                            (?&equals)
                            (?&quote) ?
            (?<ENCODING>    (?&value)       )
        }sx;
        if (lc $encoding ne lc $+{ENCODING}) {
            say "[XML ENCODING $encoding => $+{ENCODING}]";
            $encoding = $+{ENCODING};
        }
    }

    while (/$Meta_Tag_Rx/gi) {
        my $meta = $+{META};

        next unless $meta =~ m{             $RX_SUBS
            (?= http-equiv )    (?&name)
                                (?&equals)
            (?= (?&quote)? content-type )
                                (?&value)
        }six;

        next unless $meta =~ m{             $RX_SUBS
            (?= content )       (?&name)
                                (?&equals)
            (?<CONTENT>         (?&value)    )
        }six;

        next unless $+{CONTENT} =~ m{       $RX_SUBS
            (?= charset )       (?&name)
                                (?&equals)
            (?<CHARSET>         (?&value)    )
        }six;

        if (lc $encoding ne lc $+{CHARSET}) {
            say "[HTTP-EQUIV ENCODING $encoding => $+{CHARSET}]";
            $encoding = $+{CHARSET};
        }
    }

    return decode($encoding, $_);
}
########################################################################
# Make sure to this function is called
# as soon as source unit has been compiled.
UNITCHECK { load_rxsubs() }

# useful regex subroutines for HTML parsing
sub load_rxsubs {

    our $RX_SUBS = qr{
      (?(DEFINE)

        (?<WS> \s *  )

        (?<any_nv_pair>     (?&name) (?&equals) (?&value)         )
        (?<name>            \b (?=  \pL ) [\w:\-] +  \b           )
        (?<equals>          (?&WS)  = (?&WS)    )
        (?<value>           (?&quoted_value) | (?&unquoted_value) )
        (?<unwhite_chunk>   (?: (?! > ) \S ) +                    )

        (?<unquoted_value>  [\w:\-] *                             )

        (?<any_quote>  ["']      )

        (?<quoted_value>
            (?<quote>   (?&any_quote)  )
            (?: (?! \k<quote> ) . ) *
            \k<quote>
        )

        (?<start_tag>       < (?&WS)      )
        (?<html_end_tag>      >           )
        (?<xhtml_end_tag>   / >           )
        (?<end_tag>
            (?&WS)
            (?: (?&html_end_tag)
              | (?&xhtml_end_tag) )
         )

        (?<tag>
            (?&start_tag)
            (?&name)
            (?:
                (?&WS)
                (?&any_nv_pair)
            ) *
            (?&end_tag)
        )

        (?<untag> </ (?&name) > )

        # starts like a tag, but has screwed up quotes inside it
        (?<nasty>
            (?&start_tag)
            (?&name)
            .*?
            (?&end_tag)
        )

        (?<nontag>    [^<] +            )

        (?<string> (?&quoted_value)     )
        (?<word>   (?&name)             )

        (?<doctype>
            <!DOCTYPE
                # please don't feed me nonHTML
                ### (?&WS) HTML
            [^>]* >
        )

        (?<cdata>   <!\[CDATA\[     .*?     \]\]    > )
        (?<script>  (?= <script ) (?&tag)   .*?     </script> )
        (?<style>   (?= <style  ) (?&tag)   .*?     </style> )
        (?<comment> <!--            .*?           --> )

        (?<xml>
            < \? xml
            (?:
                (?&WS)
                (?&any_nv_pair)
            ) *
            (?&WS)
            \? >
        )

        (?<xhook> < \? .*? \? > )

      )

    }six;

    our $Meta_Tag_Rx = qr{                          $RX_SUBS
        (?<META>
            (?&start_tag) meta \b
            (?:
                (?&WS) (?&any_nv_pair)
            ) +
            (?&end_tag)
        )
    }six;

}

# nobody *ever* remembers to do this!
END { close STDOUT }
Run Code Online (Sandbox Code Playgroud)

  • 对于那些不知道的人,我想我会提到汤姆是"Programming Perl"(又名骆驼书)和Perl最高权威之一的合着者.如果你怀疑这是真正的汤姆克里斯蒂安森,请回去阅读帖子. (168认同)
  • @tchrist非常令人印象深刻.你显然是一个技术高超,才华横溢的Perl程序员,对现代正则表达式非常了解.但我会指出,你所写的并不是一个正则表达式(现代的,常规的或其他的),而是一个大量使用正则表达式的Perl程序.您的帖子是否真的支持正则表达式可以正确解析HTML的说法?或者更像是*Perl*可以正确解析HTML的证据?无论哪种方式,干得好! (64认同)
  • @tchrist,这绝不会回答OP的原始问题.并且*解析*这里是正确的术语吗?Afaics正则表达式正在进行标记化/词法分析,但是使用Perl代码完成最终解析,而不是正则表达式本身. (27认同)
  • 您的评论中的两个亮点"我一直使用解析类,特别是如果它是我自己没有生成的HTML." 并且"模式不一定是丑陋的,它们也不一定很难.如果你制造丑陋的图案,那就是对你的反思,而不是它们." 我完全同意你所说的,所以我正在重估这个问题.非常感谢这么详细的答案 (23认同)
  • 总结一下:RegEx被误称.我认为这是一种耻辱,但它不会改变.兼容的"RegEx"引擎不允许拒绝非常规语言.因此,仅使用Finte状态机无法正确实现它们.围绕计算类的强大概念不适用.使用RegEx不能确保O(n)执行时间.RegEx的优点是简洁的语法和隐含的字符识别领域.对我而言,这是一个缓慢移动的火车残骸,不可能将目光移开,但可怕的后果正在展开. (20认同)
  • 对未来的警告.这里是龙.请阅读此内容,http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454等. (6认同)
  • @Jonathan M:是的,当然会的.否则它会被打破,愚蠢和错误 - 就像大多数人的做法一样.但不是我的.:) (6认同)
  • 这是传统Perl正则表达式和Perl 6规则之间的奇怪区域.汤姆真的写了一个语法,虽然匹配运营商理解它.:) (5认同)
  • 此答案已添加到[Stack Overflow Regular Expressions FAQ](http://stackoverflow.com/a/22944075/2736496)的"常规信息>何时不使用Regex"下. (5认同)
  • @Salman:很乐意提供帮助.我的第二点是更重要的一点.我真的希望人们不要再用**%@#¡%^¿>±··························································································· .可能最重要的是使用自上而下的编程和有意义的*按字母顺序命名的*标识符来应用问题分解.它真的改变了一切,不是吗?**ƎƨɐƎ⅂d - ƨuɹəʇʇɐdλlƃnɟəɹoɯo** (4认同)
  • @Tom - 我已经研究过(最后)Friedl的MRE3,但是这篇文章清楚地告诉我,我有一个很长的路要走(真正"知道"正则表达式 - 在Neo中:_"我知道功夫!"感觉".是否有一本同样写得好的书籍和资源,你可以推荐这可能有助于我提升到一个新的水平?并感谢优秀的帖子!+1 (4认同)

med*_*iev 124

  1. 你可以写一本像tchrist这样的小说
  2. 您可以使用DOM库,加载HTML并使用xpath并使用//input[@type="hidden"].或者,如果您不想使用xpath,只需获取所有输入并过滤隐藏哪些输入getAttribute.

我更喜欢#2.

<?php

$d = new DOMDocument();
$d->loadHTML(
    '
    <p>fsdjl</p>
    <form><div>fdsjl</div></form>
    <input type="hidden" name="blah" value="hide yo kids">
    <input type="text" name="blah" value="hide yo kids">
    <input type="hidden" name="blah" value="hide yo wife">
');
$x = new DOMXpath($d);
$inputs = $x->evaluate('//input[@type="hidden"]');

foreach ( $inputs as $input ) {
    echo $input->getAttribute('value'), '<br>';
}
Run Code Online (Sandbox Code Playgroud)

结果:

hide yo kids<br>hide yo wife<br>
Run Code Online (Sandbox Code Playgroud)

  • 实际上,那是我的观点.我想表明它有多难. (71认同)
  • 那里非常好的东西.我真的希望人们能够展示使用解析类更容易,谢谢!我只想要一个有关极端麻烦的实例,你必须使用正则表达式从头开始.我肯定希望大多数人总结在通用HTML上使用预制解析器而不是自己编写.然而,正则表达式对于他们自己创建的简单HTML仍然很好,因为它消除了99.98%的复杂性. (19认同)
  • 阅读这两个非常有趣的方法后,将会比较一种方法与另一种方法(即基于正则表达式的VS解析类)的速度/内存使用/ CPU之间的好处. (5认同)

Pla*_*ure 106

与此处的所有答案相反,对于您正在尝试做的事情,正则表达式是一个完全有效的解决方案.这是因为你没有尝试匹配平衡标签 - 正则表达式是不可能的!但是你只匹配一个标签中的内容,这是完全正常的.

不过这是问题所在.你不能只用一个正则表达式来做...你需要做一个匹配来捕获一个<input>标记,然后对它进行进一步的处理.请注意,这只有在属性值中没有任何属性值时才有效>,因此它并不完美,但它应该足以让您获得理智的输入.

这里有一些Perl(伪)代码向您展示我的意思:

my $html = readLargeInputFile();

my @input_tags = $html =~ m/
    (
        <input                      # Starts with "<input"
        (?=[^>]*?type="hidden")     # Use lookahead to make sure that type="hidden"
        [^>]+                       # Grab the rest of the tag...
        \/>                         # ...except for the />, which is grabbed here
    )/xgm;

# Now each member of @input_tags is something like <input type="hidden" name="SaveRequired" value="False" />

foreach my $input_tag (@input_tags)
{
  my $hash_ref = {};
  # Now extract each of the fields one at a time.

  ($hash_ref->{"name"}) = $input_tag =~ /name="([^"]*)"/;
  ($hash_ref->{"value"}) = $input_tag =~ /value="([^"]*)"/;

  # Put $hash_ref in a list or something, or otherwise process it
}
Run Code Online (Sandbox Code Playgroud)

这里的基本原则是,不要试图用一个正则表达式做太多.正如您所注意到的,正则表达式强制执行一定数量的顺序.因此,您需要做的是首先匹配您要提取的内容的上下文,然后对您想要的数据进行子匹配.

编辑:但是,我同意,一般来说,使用HTML解析器可能更容易,更好,你真的应该考虑重新设计代码或重新检查你的目标.:-)但是我不得不发布这个答案作为反击的反击,解析HTML的任何子集是不可能的:当你考虑整个规范时,HTML和XML都是不规则的,但标签的规范是规则的规则,当然在PCRE的力量范围内.

  • 这里的答案与*all*不相反.:) (13认同)
  • 关于该主题的最佳SO答案的强制性链接(可能是最佳SO答案期):http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454# 1732454 (13认同)
  • 好吧 - 出于某种原因,我的打字时间比你的要长.我认为我的键盘必须要润滑.:) (7认同)
  • @tchrist:当我发布我的答案时,你的答案不在这里.;-) (6认同)
  • 这是无效的HTML - 它应该是value ="&lt;你真的对此确定吗?&gt;" 如果他正在拼抢的地方做了一件糟糕的工作逃避这样的事情,那么他将需要一个更复杂的解决方案 - 但如果他们做得对(如果他能控制它,他应该确保它是正确的)那么他就没事了. (6认同)
  • @DanielRibeiro - 除了它不是答案.它之所以存在,是因为多年来有足够的人发现它很有趣,以防止它被删除. (5认同)
  • <input type ="hidden"name ="question"value ="<你真的对此确定吗?>"/> (4认同)
  • @RossSnyder:[不,不是.](http://stackoverflow.com/a/5320217)此外,这个用regexp解析HTML的尝试的一个更大的问题是,`<! - <input type ="隐藏"name ="这不是输入标签"value ="这只是一个注释"/> - >`.(是的,[这也是有效的HTML](http://www.w3.org/TR/html-markup/syntax.html#comments).) (4认同)
  • 或者,就此而言,`<input value ="如何改变我的模式,使其无论<input>标签中属性的位置如何都会匹配?" name = question type = hidden />`.([是的,这也是有效的.](http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#attributes-0)) (3认同)

Dav*_*vid 21

本着汤姆克里斯蒂安森的词法分析器解决方案的精神,这里链接到罗伯特卡梅隆看似遗忘的1998年文章,REX:XML浅层解析与正则表达式.

http://www.cs.sfu.ca/~cameron/REX.html

抽象

XML的语法很简单,可以使用单个正则表达式将XML文档解析为其标记和文本项的列表.这种XML文档的浅层解析对于构建各种轻量级XML处理工具非常有用.但是,复杂的正则表达式可能难以构建,甚至更难以阅读.本文使用一种用于正则表达式的文字编程形式,记录了一组XML浅层解析表达式,这些表达式可用作简单,正确,高效,健壮且与语言无关的XML浅层解析的基础.还提供了Perl,JavaScript和Lex/Flex各自少于50行的完整浅解析器实现.

如果您喜欢阅读正则表达式,Cameron的论文非常吸引人.他的写作简洁,透彻,非常详细.他不是简单地向您展示如何构造REX正则表达式,而是一种从较小部分构建任何复杂正则表达式的方法.

我已经使用REX正则表达式打开和关闭了10年来解决初始海报询问的那种问题(我如何匹配这个特定的标签而不是其他非常相似的标签?).我发现他开发的正则表达式是完全可靠的.

当您专注于文档的词法细节时,REX特别有用 - 例如,当将一种文本文档(例如,纯文本,XML,SGML,HTML)转换为另一种文档可能无效时,良好的形成,甚至可解析大部分的转型.它允许您在文档中的任何位置定位标记岛,而不会干扰文档的其余部分.


Sua*_*ere 7

虽然我喜欢其余答案的内容,但他们并没有直接或正确地回答这个问题.即使是白金的答案也过于复杂,效率也较低.所以我被迫说出这个.

正确使用时,我是正则表达式的巨大支持者.但由于耻辱(和性能),我总是声明格式良好的XML或HTML应该使用XML Parser.甚至更好的性能也是字符串解析,尽管可读性之间存在一条线,如果它太过于失控.但是,这不是问题.问题是如何匹配隐藏类型的输入标记.答案是:

<input[^>]*type="hidden"[^>]*>
Run Code Online (Sandbox Code Playgroud)

根据您的风格,您需要包含的唯一正则表达式选项是ignorecase选项.

  • `<input type ='hidden'name ='哦,<真的>?' value ='尝试使用真正的HTML解析器.'>` (4认同)
  • 你的例子是自我关闭.应以/>结束.另外,虽然在name字段中有一个`>`的几率几乎为零,但确实有可能在一个动作句柄中有一个`>`.EG:OnClick属性上的内联javascript调用.话虽这么说,我有一个XML解析器,但我也有一个正则表达式,我给出的文件太乱了,XML解析器无法处理,但正则表达式可以.此外,这不是问题所在.你永远不会用隐藏的输入遇到这些情况,我的答案是最好的."雅,<真的>!" (4认同)
  • `/>`是一个XML主义; 在任何版本的HTML中都不需要它,除了XHTML(它从未真正获得太多牵引力,并且几乎被HTML5取代).你是对的,那里有很多混乱的非常有效的HTML,但是一个好的HTML(*not*XML)解析器应该能够应对它的大部分; 如果他们不这样做,很可能也不会浏览器. (3认同)