正则表达式提取<div>标签的内容

Question

正则表达式提取<div>标签的内容

这里有一点大脑冻结所以我希望有一些指针,基本上我需要提取特定div标签的内容,是的我知道正则表达式通常不被批准用于此但它是一个简单的网络抓取应用程序,其中没有嵌套的div.

我想要匹配这个:

    <div class="entry">
  <span class="title">Some company</span>
  <span class="description">
  <strong>Address: </strong>Some address
    <br /><strong>Telephone: </strong> 01908 12345
  </span>
</div>

Run Code Online (Sandbox Code Playgroud)

简单的vb代码如下:

    Dim myMatches As MatchCollection
    Dim myRegex As New Regex("<div.*?class=""entry"".*?>.*</div>", RegexOptions.Singleline)
    Dim wc As New WebClient
    Dim html As String = wc.DownloadString("http://somewebaddress.com")
    RichTextBox1.Text = html
    myMatches = myRegex.Matches(html)
    MsgBox(html)
    'Search for all the words in a string
    Dim successfulMatch As Match
    For Each successfulMatch In myMatches
        MsgBox(successfulMatch.Groups(1).ToString)
    Next

Run Code Online (Sandbox Code Playgroud)

任何帮助将不胜感激.

Answer 1

Tim*_*ker 7

你的正则表达式适用于你的例子.但是,应该做出一些改进:

<div[^<>]*class="entry"[^<>]*>(?<content>.*?)</div>

Run Code Online (Sandbox Code Playgroud)

[^<>]* 表示"匹配除尖括号之外的任意数量的字符",确保我们不会意外地突破我们所处的标记.

.*?(注意?)表示"匹配任意数量的字符,但只能尽可能少".这样可以避免从<div class="entry">页面中的第一个标记到最后一个标记进行匹配.

但是,你的正则表达式本身应该还是有相匹配的东西.也许你没有正确使用它？

我不知道Visual Basic,所以这只是在黑暗中拍摄,但RegexBuddy建议采用以下方法:

Dim RegexObj As New Regex("<div[^<>]*class=""entry""[^<>]*>(?<content>.*?)</div>")
Dim MatchResult As Match = RegexObj.Match(SubjectString)
While MatchResult.Success
    ResultList.Add(MatchResult.Groups("content").Value)
    MatchResult = MatchResult.NextMatch()
End While

Run Code Online (Sandbox Code Playgroud)

我建议不要采取比这更进一步的正则表达式方法.如果你坚持,你最终会得到如下所示的怪物正则表达式,这只有在div内容的形式永远不变的情况下才会起作用:

<div[^<>]*class="entry"[^<>]*>\s*
<span[^<>]*class="title"[^<>]*>\s*
(?<title>.*?)
\s*</span>\s*
<span[^<>]*class="description"[^<>]*>\s*
<strong>\s*Address:\s*</strong>\s*
(?<address>.*?)
\s*<strong>\s*Telephone:\s*</strong>\s*
(?<phone>.*?)
\s*</span>\s*</div>

Run Code Online (Sandbox Code Playgroud)

或者(看到VB.NET中多行字符串的乐趣):

Dim RegexObj As New Regex(
    "<div[^<>]*class=""entry""[^<>]*>\s*" & chr(10) & _
    "<span[^<>]*class=""title""[^<>]*>\s*" & chr(10) & _
    "(?<title>.*?)" & chr(10) & _
    "\s*</span>\s*" & chr(10) & _
    "<span[^<>]*class=""description""[^<>]*>\s*" & chr(10) & _
    "<strong>\s*Address:\s*</strong>\s*" & chr(10) & _
    "(?<address>.*?)" & chr(10) & _
    "\s*<strong>\s*Telephone:\s*</strong>\s*" & chr(10) & _
    "(?<phone>.*?)" & chr(10) & _
    "\s*</span>\s*</div>", 
    RegexOptions.Singleline Or RegexOptions.IgnorePatternWhitespace)

Run Code Online (Sandbox Code Playgroud)

(当然,现在你需要存储结果MatchResult.Groups("title")等...)

归档时间：	13 年，7 月前
查看次数：	22218 次
最近记录：	10 年，4 月前