我正在玩一个玩具HTML解析器,以帮助我熟悉Scala的解析组合库:
import scala.util.parsing.combinator._
sealed abstract class Node
case class TextNode(val contents : String) extends Node
case class Element(
val tag : String,
val attributes : Map[String,Option[String]],
val children : Seq[Node]
) extends Node
object HTML extends RegexParsers {
val node: Parser[Node] = text | element
val text: Parser[TextNode] = """[^<]+""".r ^^ TextNode
val label: Parser[String] = """(\w[:\w]*)""".r
val value : Parser[String] = """("[^"]*"|\w+)""".r
val attribute : Parser[(String,Option[String])] = label ~ (
"=" ~> value ^^ Some[String] | "" ^^ { case _ => None }
) ^^ { case (k ~ v) => k -> v }
val element: Parser[Element] = (
("<" ~> label ~ rep(whiteSpace ~> attribute) <~ ">" )
~ rep(node) ~
("</" ~> label <~ ">")
) ^^ {
case (tag ~ attributes ~ children ~ close) => Element(tag, Map(attributes : _*), children)
}
}
Run Code Online (Sandbox Code Playgroud)
我想要的是确保我的开始和结束标签匹配的一些方法.
我想这样做,我需要某种flatMap组合器〜Parser[A] => (A => Parser[B]) => Parser[B],所以我可以使用开始标记来构造结束标记的解析器.但我没有看到任何与该库中的签名相匹配的内容.
这样做的正确方法是什么?
您可以编写一个带有标记名称的方法,并为具有该名称的结束标记返回解析器:
object HTML extends RegexParsers {
lazy val node: Parser[Node] = text | element
val text: Parser[TextNode] = """[^<]+""".r ^^ TextNode
val label: Parser[String] = """(\w[:\w]*)""".r
val value : Parser[String] = """("[^"]*"|\w+)""".r
val attribute : Parser[(String, Option[String])] = label ~ (
"=" ~> value ^^ Some[String] | "" ^^ { case _ => None }
) ^^ { case (k ~ v) => k -> v }
val openTag: Parser[String ~ Seq[(String, Option[String])]] =
"<" ~> label ~ rep(whiteSpace ~> attribute) <~ ">"
def closeTag(name: String): Parser[String] = "</" ~> name <~ ">"
val element: Parser[Element] = openTag.flatMap {
case (tag ~ attrs) =>
rep(node) <~ closeTag(tag) ^^
(children => Element(tag, attrs.toMap, children))
}
}
Run Code Online (Sandbox Code Playgroud)
请注意,您还需要做node懒惰.现在,您可以获得不匹配标记的干净错误消息:
scala> HTML.parse(HTML.element, "<a></b>")
res0: HTML.ParseResult[Element] =
[1.6] failure: `a' expected but `b' found
<a></b>
^
Run Code Online (Sandbox Code Playgroud)
为了清楚起见,我比必要的要冗长一点.如果你想要简洁,你可以跳过openTag和closeTag方法并element像这样写,例如:
val element = "<" ~> label ~ rep(whiteSpace ~> attribute) <~ ">" >> {
case (tag ~ attrs) =>
rep(node) <~ "</" ~> tag <~ ">" ^^
(children => Element(tag, attrs.toMap, children))
}
Run Code Online (Sandbox Code Playgroud)
我确信更简洁的版本是可能的,但在我看来,这甚至会逐渐走向不可读性.