Fre*_*ind 4 java parsing scala line
我想使用scala来解析.mht文件,但我发现我的代码与Java完全一样.
以下是mht
文件样本:
From: <Save by Tencent MsgMgr>
Subject: Tencent IM Message
MIME-Version: 1.0
Content-Type:multipart/related;
charset="utf-8"
type="text/html";
boundary="----=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19"
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type: text/html
Content-Transfer-Encoding:7bit
<html xmlns="http://www.w3.org/1999/xhtml"><head></head>...</html>
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type:image/jpeg
Content-Transfer-Encoding:base64
Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU
FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU
FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF
BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW
V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type:image/jpeg
Content-Transfer-Encoding:base64
Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU
FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU
FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF
BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW
V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type:image/jpeg
Content-Transfer-Encoding:base64
Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU
FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU
FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF
BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW
V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Run Code Online (Sandbox Code Playgroud)
有一个特殊的线boundary
,它是一个分隔线:
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Run Code Online (Sandbox Code Playgroud)
第一部分是关于该文件的一些信息,可以忽略.以下是4个块,第一个是html
文件,其他是jpg
带有base64
编码文本的图像.
如果我使用Java,代码如下:
BufferedReader reader = new BufferedReader(new FileInputStream(new File("test.mht")))
String line = null;
String boundary = null;
// for a block
String contentType = null;
String encoding = null;
String location = null;
List<String> data = null;
while((line=reader.readLine())!=null) {
// first, get the boundary
if(boundary==null) {
if(line.trim().startsWith("boundary=\"") {
boundary = substringBetween(line, "\"", "\"");
}
continue;
}
if(line.equals("--"+boundary) { // new block
if(contentType!=null) {
// save data to a file
}
encoding=null;
contentType=null;
location = null;
data = new ArrayList<String>();
} else {
if(id==null || contentType==null || location ==null) {
if(line.trim().startsWith("Content-Type:") { /* get content type */ }
// else check encoding
// else check location
} else {
data.add(line);
}
}
}
Run Code Online (Sandbox Code Playgroud)
我尝试使用scala重写代码,但我发现我的代码结构几乎相同,只是我使用scala语法而不是Java.
是否有scala方式来做同样的工作?
PS:我不想将整个文件加载到内存中,因为文件很大.相反,我想逐行阅读和解析它.
谢谢你的帮助!
我将解释如何使用解析器组合器以标准方式构建通用解决方案.提出的另一个解决方案要快得多,但是,一旦您了解了如何执行此操作,您就可以轻松地将其应用于其他任务.
首先,您要显示的是电子邮件.这类消息的格式在一堆RFC中定义.RFC-822定义了标题和正文的基础知识,虽然它相当详细地介绍了标题,但没有说明正文.RFC-1521和1522讨论了MIME,它们本身就是RFC 1341和1342的修订版.还有许多关于这个主题的RFC.
有趣的是,他们为这些东西提供语法,所以你可以编写解析器来正确分解它.让我们从RFC822的简化版本开始,几乎忽略了所有已知字段及其格式,并简单地将所有内容放在地图中.我这样做是因为语法相当长,我在这里的几行已经可以与RFC中的那些相比较.
在Scala Parser组合器上,每个规则都由~
(在RFC中,只是空格分隔它们)分开,并且我使用<~
或~>
有时丢弃它中不感兴趣的部分.此外,我曾经^^
将解析的内容转换为要使用的数据结构.
import scala.util.parsing.combinator._
/** Object companion to RFC822, containing the Message class,
* and extending the trait so that it can be used as a parser
*/
object RFC822 extends RFC822 {
case class Message(header: Map[String, String], text: String)
}
/**
* Parsers `message` according to RFC-822 (http://www.w3.org/Protocols/rfc822/),
* but without breaking up the contents for each field,
* nor identifying particular fields.
*
* Also, introduces "header" to convert all fields into a map.
*/
class RFC822 extends RegexParsers {
import RFC822.Message
override def skipWhitespace = false
def message = (header <~ CRLF) ~ text ^^ {
case hd ~ txt => Message(hd, txt)
}
// this isn't part of the RFC, but we use it to generate a map
def header = field.* ^^ { _.toMap }
def field = (fieldName <~ ":") ~ fieldBody <~ CRLF ^^ { case name ~ body => name -> body }
def fieldName = """[^:\P{Graph}]+""".r
// Recursive definition needs a type
// Also, I use .+ on LWSPChar because it's specified for the lexer,
// which we are not using
def fieldBody: Parser[String] = fieldBodyContents ~ (CRLF ~> LWSPChar.+ ~> fieldBody).? ^^ {
case a ~ Some(b) => a + " " + b // reintroduces a single LWSPChar
case a ~ None => a
}
def fieldBodyContents = ".*".r
def CRLF = """\n""".r // this needs to be the regex \n pattern
def LWSPChar = " " | "\t" // these do not need to be regex
def text = "(?s).*".r // (?s) makes . match newlines
}
Run Code Online (Sandbox Code Playgroud)
现在让我们来处理内容类型.RFC-1521的规范是在下面实现的.我type
在反引号之间有这个词,因为它是Scala中的保留字.另外,我正在制作一个分号可选,因为你给出的样本在定义之后缺少一个char-set
.
object ContentType extends ContentType {
case class Content(`type`: String, subtype: String, parameter: Map[String, String])
}
class ContentType extends RegexParsers {
import ContentType.Content
// case-insensitive matching of type and subtype
def content = ("Content-Type" ~> ":" ~> `type` <~ "/") ~ subtype ~ parameters ^^ {
case t ~ s ~ p => Content(t, s, p)
}
// use this to generate a map
// *** SEMI-COLON IS NOT OPTIONAL ***
// I'm making it optional because the example is missing one
def parameters = (";".? ~> parameter).* ^^ (_.toMap)
// All values case-insensitive
def `type` = ( "(?i)application".r | "(?i)audio".r
| "(?i)image".r | "(?i)message".r
| "(?i)multipart".r | "(?i)text".r
| "(?i)video".r | extensionToken
)
def extensionToken = xToken | ianaToken
def ianaToken = failure("IANA token not implemented")
def xToken = """(?i)x-(?!\s)""".r ~ token ^^ { case a ~ b => a + b }
def subtype = token
def parameter = (attribute <~ "=") ~ value ^^ { case a ~ b => a -> b }
def attribute = token // case-insensitive
def value = token | quotedString
def token: Parser[String] = not(tspecials) ~> """\p{Graph}""".r ~ token.? ^^ {
case a ~ Some(b) => a + b
case a ~ None => a
}
// Must be in quoted-string,
// to use within parameter values
def tspecials = ( "(" | ")" | "<" | ">" | "@"
| "," | ";" | ":" | "\\" | "\""
| "/" | "[" | "]" | "?" | "="
)
// These are part of RFC822
def qtext = """[^\\"\n]""".r
def quotedPair = """\\.""".r
def quotedString = "\"" ~> (qtext|quotedPair).* <~ "\"" ^^ { _.mkString }
}
Run Code Online (Sandbox Code Playgroud)
我们现在可以使用它来解析文本.
object Parser {
def apply(email: String): Option[(Map[String, String], List[String])] = {
import RFC822._
parseAll (message, email) match {
case Success(result, _) =>
if (result.header get "Content-Type" nonEmpty) Some(getParts(result))
else Some(result.header -> List(result.text))
case _ => None
}
}
def getParts(message: RFC822.Message): (Map[String, String], List[String]) = {
import ContentType._
parseAll (content, "Content-Type: " + message.header("Content-Type")) match {
case Success(Content("multipart", _, parameters), _) =>
// The ^.* part eats starting characters; it doesn't seem to be
// as spec'ed, but the sample has two extra dashes at the start
// of the line
val parts = message.text split ("^.*?\\Q" + parameters("boundary") + "\\E")
val bodies = flatMap this.apply flatMap (_._2)
message.header -> bodies.toList
case _ => message.header -> List(message.text)
}
}
}
Run Code Online (Sandbox Code Playgroud)
然后你可以像使用它一样Parser(email)
.
同样,我不建议你使用这个解决方案来解决当前的问题!但是学习这个可能会对你有所帮助.
这可能是一个非常简单的状态机用例.
import collection.mutable.ListBuffer
case class Part(contentType:Option[String], encoding:Option[String], location:Option[String], data:ListBuffer[String])
var boundary: String = null
val Boundary = """.*boundary="(.*)"""".r
var state = 0
val IN_PART = 1
val IN_DATA = 2
var _contentType:Option[String] = None
var _encoding:Option[String] = None
var _location:Option[String] = None
var _data = new ListBuffer[String]()
Source.fromFile("test.mht").getLines.foreach{
case Boundary(b) => boundary = b
case `boundary` =>
_contentType = None
_encoding = None
_location = None
_data = new ListBuffer[String]()
state = IN_PART
case "" => state match {
case IN_PART => state = IN_DATA
case IN_DATA =>
var currentPart = Part(_contentType, _encoding, _location, _data)
/* deal with current Part as allData.last */
case _ =>
}
case line => state match {
case IN_DATA => _data.append(line)
case IN_PART => line.split(":") match {
case Array("Content-Type", t) => _contentType = Some(t)
case Array("Content-Transfer-Encoding", e) => _encoding = Some(e)
case Array("Content-Location", l) => _location = Some(l)
case _ =>
}
}
}
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
3297 次 |
最近记录: |