解析perl中的表数据有问题

use*_*707 11 html regex perl parsing

我有一个类似模式的长htdoc,继续这样:

<td class="MODULE_PRODUCTS_CELL " align="center" valign="top" height="100">
<table width="100" summary="products"><tr>
<td align="center" height="75">
<a href="/collections.php?prod_id=50">
<img src="files/products_categories50_t.txt" border="0" alt="products" /></a><\br>
</td>
</tr>
<tr>
<td align="center">
<a href="/collections.php?prod_id=50"><strong>Buffer</strong><br />
</a>
<td>
</tr></table>
</td>
Run Code Online (Sandbox Code Playgroud)

在上面的html中我想提取:

  1. collections.php?prod_id=50
  2. files/products_categories50_t.txt
  3. Buffer

我已经尝试过这段代码,

#!/usr/local/bin/perl

use strict;
use warnings;
my $filename =  'sr.txt';

open(FILENAME,$filename);
my @str = <FILENAME>;
chomp(@str);
#print "@str";

foreach my  $str(@str){    
     if ($str =~/<td class(.*)<a href(.*?)><\/td>/) {
         print "*****$2\n";
     }    
}
Run Code Online (Sandbox Code Playgroud)

此代码是试用版.然而,它只带来最后一次出现,而不是每次出现.为什么?

tch*_*ist 56

摘要

在很少,有限的合理定义的HTML片段上使用模式是快速而简单的.但是在整个文档中使用它们,包含完全一般的,开放式的不可预见的怪癖HTML,虽然理论上可行,但实际上与使用已经为此明确目的编写的其他人的解析器相比太难了.有关在XML或HTML上使用模式的更一般性讨论,请参阅此答案.

NaïveRegex解决方案

您已经要求提供正则表达式解决方案,因此我将为您提供此类解决方案.

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;

$/ = undef;
$_ = <DATA>;   # read all input

while (m{ < \s* img [^>]* src \s* = \s* ['"]? ([^<>'"]+) }gsix) {
    print "IMG SRC=$1\n";
}

while (m{ < \s* a [^>]* href \s* = \s* ['"]? ([^<>'"]+) }gsix) {
    print "A HREF=$1\n";
}

while (m{ < \s* strong [^>]* > (.*?) < \s* / \s* strong \s* > }gsix) {
    print "STRONG=$1\n";
}

__END__

<td class="MODULE_PRODUCTS_CELL" align="center" valign="top" height="100">
<table width="100" summary="products">
    <tr>
        <td align="center" height="75">
            <a href="/collections.php?prod_id=50">
                <img src="files/products_categories50_t.txt" border="0" alt="products" />
            </a>
            <br/>
        </td>
    </tr>
    <tr>
        <td align="center">
            <a href="/collections.php?prod_id=50">
                <strong>Buffer</strong><br />
            </a>
        <td>
    </tr>
</table>
</td>
Run Code Online (Sandbox Code Playgroud)

该程序在运行时会产生以下输出:

IMG SRC=files/products_categories50_t.txt
A HREF=/collections.php?prod_id=50
A HREF=/collections.php?prod_id=50
STRONG=Buffer
Run Code Online (Sandbox Code Playgroud)

如果您非常确定它适用于您希望的HTML特定样本,那么请务必使用它.注意我做的几件你没做过的事情.其中一个不是一次处理HTML一行.这几乎无法奏效.

但是,此排序解决方案仅适用于极其有限的有效HTML格式.只有当您可以保证您正在使用的HTML看起来像您期望的那样时,才能使用它.

问题是,它往往看起来并不整洁.对于这些情况,强烈建议您使用HTML解析类.但是,似乎没有人向您展示这样做的代码.这不是很有帮助.

向导级正则表达式解决方案

我自己也会成为其中之一.因为我将向您展示一个更通用的解决方案,以便接近我认为您的想法,但不像其他任何人发布Stack Overflow,我会使用正则表达式来做,只是为了告诉你它可以完成,但你希望这样做:

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;

$/ = undef;
$_ = <DATA>;   # read all input

our(
    $RX_SUBS,
    $tag_template_rx,
    $script_tag_rx,
    $style_tag_rx,
    $strong_tag_rx,
    $a_tag_rx,
    $img_tag_rx,
);

# strip stuff we aren't supposed to look at
s{ <!    DOCTYPE  .*?         > }{}sx; 
s{ <! \[ CDATA \[ .*?    \]\] > }{}gsx; 

s{ $style_tag_rx  .*?  < (?&WS) / (?&WS) style  (?&WS) > }{}gsix; 
s{ $script_tag_rx .*?  < (?&WS) / (?&WS) script (?&WS) > }{}gsix; 
s{ <!--     .*?        --> }{}gsx;

while (/$img_tag_rx/g) {
    my $tag = $+{TAG};
    printf "IMG tag at %d: %s\n", pos(), $tag;
    while ($tag =~ 
        m{ 
            $RX_SUBS  
            \b src (?&WS) = (?&WS) 
            (?<VALUE> 
                (?: (?&quoted_value) | (?&unquoted_value) ) 
            )
        }gsix) 
    {
        my $value = dequote($+{VALUE});
        print "\tSRC is $value\n";
    } 

} 

while (/$a_tag_rx/g) {
    my $tag = $+{TAG};
    printf "A tag at %d: %s\n", pos(), $tag;
    while ($tag =~ 
        m{ 
            $RX_SUBS  
            \b href (?&WS) = (?&WS) 
            (?<VALUE> 
                (?: (?&quoted_value) | (?&unquoted_value) ) 
            )
        }gsix) 
    {
        my $value = dequote($+{VALUE});
        print "\tHREF is $value\n";
    } 
} 

while (m{
            $strong_tag_rx  (?&WS) 
            (?<BODY> .*? )  (?&WS) 
            < (?&WS) / (?&WS) strong (?&WS) > 
        }gsix) 
{
    my ($tag, $body) = @+{ qw< TAG BODY > };
    printf "STRONG tag at %d: %s\n\tBODY=%s\n", 
            pos(), $+{TAG}, $+{BODY};
} 

exit;

sub dequote { 
    my $string = shift();
    $string =~ s{
        ^
        (?<quote>   ["']      )
        (?<BODY> 
            (?: (?! \k<quote> ) . ) *
        )
        \k<quote> 
        $
    }{$+{BODY}}gsx;
    return $string;
}

sub load_patterns { 

    $RX_SUBS = qr{ (?(DEFINE)

        (?<any_attribute> 
            \b \w+
            (?&WS) = (?&WS) 
            (?:
                (?&quoted_value)
              | (?&unquoted_value)
            )
        )

        (?<unquoted_value> 
            (?&unwhite_chunk) 
        )

        (?<quoted_value>
            (?<quote>   ["']      )
            (?: (?! \k<quote> ) . ) *
            \k<quote> 
        )

        (?<unwhite_chunk>   
            (?:
                # (?! [<>'"] ) 
                (?! > ) 
                \S
            ) +   
        )

        (?<WS>     \s *   )

        (?<end_tag>          
            (?&html_end_tag)
          | (?&xhtml_end_tag)
        )

        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )

      ) # end DEFINE

    }six;

    my $_TAG_SUBS = $RX_SUBS . q{ (?(DEFINE)

        (?<attributes>
            (?: 
                (?&WS) 
                (?&one_attribute) 
            ) *
        )

        (?<one_attribute>
            (?= (?&legal_attribute) )
            (?&any_attribute) 
        )

        (?<optional_attribute>
            (?&permitted_attribute)
          | (?&deprecated_attribute)
        )

        (?<legal_attribute> 
            (?: (?&required_attribute)
              | (?&optional_attribute)
              | (?&standard_attribute)
              | (?&event_attribute)
              # for LEGAL parse only, comment out next line 
              | (?&illegal_attribute)
            )
        )

        (?<optional_attribute>
            (?&permitted_attribute)
          | (?&deprecated_attribute)
        )

        (?<illegal_attribute> \b \w+ \b )

        (?<tag>
            (?&start_tag)
            (?&WS) 
            (?&attributes) 
            (?&WS) 
            (?&end_tag)
        )

      ) # end DEFINE

    };  # this is a q tag, not a qr

    $tag_template_rx = qr{ 

            $_TAG_SUBS

        (?<TAG> (?&XXX_tag) )

        (?(DEFINE)
            (?<XXX_tag>     (?&tag)             )
            (?<start_tag>  < (?&WS) XXX \b      )
            (?<required_attribute>      (*FAIL) )
            (?<standard_attribute>      (*FAIL) )
            (?<event_attribute>         (*FAIL) )
            (?<permitted_attribute>     (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )

        ) # end DEFINE
    }six;

    $script_tag_rx = qr{   

            $_TAG_SUBS

        (?<TAG> (?&script_tag) )
        (?(DEFINE)
            (?<script_tag>  (?&tag)                )
            (?<start_tag>  < (?&WS) style \b       )
            (?<required_attribute>      type )
            (?<permitted_attribute>             
                charset     
              | defer
              | src
              | xml:space
            )
            (?<standard_attribute>      (*FAIL) )
            (?<event_attribute>         (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )
        ) # end DEFINE
    }six;

    $style_tag_rx = qr{    

            $_TAG_SUBS

        (?<TAG> (?&style_tag) )

        (?(DEFINE)

            (?<style_tag>  (?&tag)  )

            (?<start_tag>  < (?&WS) style \b       )

            (?<required_attribute>      type    )
            (?<permitted_attribute>     media   )

            (?<standard_attribute>
                dir
              | lang
              | title
              | xml:lang
            )

            (?<event_attribute>         (*FAIL) )
            (?<permitted_attribute>     (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )

        )  # end define

    }six;

    $strong_tag_rx = qr{    

            $_TAG_SUBS

        (?<TAG> (?&strong_tag) )

        (?(DEFINE)

            (?<strong_tag>  (?&tag)  )

            (?<start_tag>  
                < (?&WS) 
                strong 
                \b       
            )

            (?<standard_attribute>
                class       
              | dir 
              | ltr 
              | id  
              | lang        
              | style       
              | title       
              | xml:lang
            )

            (?<event_attribute>
                on click    
                on dbl click        
                on mouse down       
                on mouse move       
                on mouse out        
                on mouse over       
                on mouse up 
                on key down 
                on key press        
                on key up
            )

            (?<required_attribute>      (*FAIL) )
            (?<permitted_attribute>     (*FAIL) )
            (?<optional_attribute>      (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )

        ) # end DEFINE

    }six; 

    $a_tag_rx = qr{         

            $_TAG_SUBS

        (?<TAG> (?&a_tag) )

        (?(DEFINE)
            (?<a_tag>  (?&tag)  )

            (?<start_tag>  
                < (?&WS) 
                a 
                \b       
            )

            (?<permitted_attribute>
                charset     
              | coords      
              | href        
              | href lang   
              | name        
              | rel 
              | rev 
              | shape       
              | rect
              | circle
              | poly        
              | target
            )

            (?<standard_attribute>
                access key  
              | class       
              | dir 
              | ltr 
              | id
              | lang        
              | style       
              | tab index   
              | title       
              | xml:lang
            )

            (?<event_attribute>
                on blur     
              | on click    
              | on dbl click        
              | on focus    
              | on mouse down       
              | on mouse move       
              | on mouse out        
              | on mouse over       
              | on mouse up 
              | on key down 
              | on key press        
                on key up
            )

            (?<required_attribute>      (*FAIL) )
            (?<deprecated_attribute>    (*FAIL) )
        ) # end define
    }xi;

    $img_tag_rx = qr{           
        $_TAG_SUBS
        (?<TAG> (?&image_tag) )
        (?(DEFINE)

            (?<image_tag> (?&tag) )

            (?<start_tag>  
                < (?&WS) 
                img 
                \b       
            )

            (?<required_attribute>
                alt
              | src
            )

            # NB: The white space in string literals 
            #     below DOES NOT COUNT!   It's just 
            #     there for legibility.

            (?<permitted_attribute>
                height
              | is map
              | long desc
              | use map
              | width
            )

            (?<deprecated_attribute>
                 align
               | border
               | hspace
               | vspace
            )

            (?<standard_attribute>
                class
              | dir
              | id
              | style
              | title
              | xml:lang
            )

            (?<event_attribute>
                on abort
              | on click
              | on dbl click
              | on mouse down
              | on mouse out
              | on key down
              | on key press
              | on key up
            )

        ###########################

        ) # end DEFINE

    }six;

}

UNITCHECK { load_patterns() } 

__END__

<td class="MODULE_PRODUCTS_CELL" align="center" valign="top" height="100">
<table width="100" summary="products">
    <tr>
        <td align="center" height="75">
            <a href="/collections.php?prod_id=50">
                <img src="files/products_categories50_t.txt" border="0" alt="products" />
            </a>
            <br/>
        </td>
    </tr>
    <tr>
        <td align="center">
            <a href="/collections.php?prod_id=50">
                <strong>Buffer</strong><br />
            </a>
        <td>
    </tr>
</table>
</td>
Run Code Online (Sandbox Code Playgroud)

该程序在运行时会产生以下输出:

IMG tag at 304: <img src="files/products_categories50_t.txt" border="0" alt="products" />
        SRC is files/products_categories50_t.txt
A tag at 214: <a href="/collections.php?prod_id=50">
        HREF is /collections.php?prod_id=50
A tag at 451: <a href="/collections.php?prod_id=50">
        HREF is /collections.php?prod_id=50
STRONG tag at 491: <strong>
        BODY=Buffer
Run Code Online (Sandbox Code Playgroud)

选择是你的 - 或者它?

这两个都解决了你的正则表达式的问题.这是可能的,你将能够使用的第一个我的两种方法.我不能说,因为在这里看似所有这些问题,你没有告诉我们足够多的关于我们(也许还有你)的数据,以确定这种天真的方法是否足够.

如果没有,你有两个选择.

  1. 您可以使用我的第二种技术提供的更强大和灵活的方法.只要确保你从各个方面理解它,否则你将无法维护你的代码 - 其他人也不会.
  2. 使用HTML解析类.

我发现即使1000人中的1人也不太可能合理地做出这两个选择中的第一个.特别值得一提的是,那些请求正则表达式帮助的人就像我在第一个解决方案中那样简单的人是能够管理我的第二个解决方案中给出的正则表达式的人.

这真的让你只有一个"选择" - 如果我可以松散地使用这个词.

  • 我再说一遍:如果你不同意我建议他们不要使用正则表达式,请提供替代方案.我怀疑你们都没有读过我写的东西,不知怎的,我认为我的建议与我的真实相反.请不要愚蠢. (20认同)

pto*_*mli 5

您可能会发现使用XPath解析这将比使用正则表达式更容易.你的数据虽然可以用更多的语义结构,但我想这可能不在你手中.

看看XML :: XPath.

10分钟的XPath教程自动化系统管理使用Perl也可能派上用场.

  • 我发现[HTML :: TreeBuilder :: XPath](http://p3rl.org/HTML::TreeBuilder::XPath)有一个比XML :: XPath更有用的API. (2认同)