C#RegEx - 找到html标签(div和锚)

cze*_*sio 2 html c# regex tags find

我必须使用它的内容检索几个div部分(特定类名称"row"),并另外找到所有锚标记(链接URL)(类"下划线红色粗体").短篇小说:得到以下部分:

<div class = "row ">
 ... (divs, tags ...)
<a class="underline red bold" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p">
Run Code Online (Sandbox Code Playgroud)

和网址集

string[] urls = {"/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p"}
Run Code Online (Sandbox Code Playgroud)

整个页面看起来像这样:

<html>
Run Code Online (Sandbox Code Playgroud)

... 很多东西

<div class="row ">

  <div class="photo">
    <a rel="nofollow" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p">
      <img alt="alt msg" src="/b/s/b9/03/b9038292d147a582add07ee1f0607827.jpg">                 
 </a>
  </div>

  <div class="desc">
    <div class="l1">
      <div class="icons">
      </div>

      <table cellspacing="0" cellpadding="0" border="0">
        <tbody>
          <tr>
            <td>
              <div class="fleft">
                <a class="underline red bold" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p">
                  Culture And Gender   <br>Intimate Relation</a>
              </div>

              <div class="fleft">

              </div>
            </td>
          </tr>
        </tbody>
      </table>
    </div>
    <div class="l2">

      <div>
      </div>
      <div>
        <div class="but">
        </div>
      </div>
    </div>
    <div class="l3">
      Long description
      <a class="underlinepix_red no_wrap" rel="nofollow" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p">
        more<img alt="" src="/b/img/arr_red_sm.gif">
  </a>
    </div>
  </div>
</div>

<div class="omit"></div>

<div class="row ">

  <div class="photo">
    <a rel="nofollow" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod56534899,p">
      <img alt="alt msg" src="/b/s/b9/03/b9038292d147a582add07ee1f06078222.jpg">                    
 </a>
  </div>

  <div class="desc">
    <div class="l1">
      <div class="icons">
      </div>

      <table cellspacing="0" cellpadding="0" border="0">
        <tbody>
          <tr>
            <td>
              <div class="fleft">
                <a class="underline red bold" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod5653489225,p">
                  Culture And Gender   <br>Intimate Relation</a>
              </div>

              <div class="fleft">

              </div>
            </td>
          </tr>
        </tbody>
      </table>
    </div>
    <div class="l2">

      <div>
      </div>
      <div>
        <div class="but">
        </div>
      </div>
    </div>
    <div class="l3">
      Long description
      <a class="underlinepix_red no_wrap" rel="nofollow" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p">
        more<img alt="" src="/b/img/arr_red_sm.gif">
  </a>
    </div>
  </div>
</div>
Run Code Online (Sandbox Code Playgroud)

有人可以帮我创建合适的reg ex吗?

Jen*_*ens 15

正则表达式不适合这种情况.

由于HTML的嵌套特性,执行所要求的正则表达式将非常(非常非常)长且复杂.请改用HTML Parser.

  • +1和'HTML Parser',Jens意味着HTML Agility Pack,对于任何C#html解析需求都没有其他任何东西:http://htmlagilitypack.codeplex.com/Wikipage (3认同)
  • 必要的链接:http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 (2认同)