Scala:以功能方式迭代CSV文件？

Question

Scala:以功能方式迭代CSV文件？

Jay*_*ker 6 csv iteration state functional-programming scala

我有CSV文件,其中包含列名称的注释,其中列在整个文件中发生变化:

#c1,c2,c3
a,b,c
d,e,f
#c4,c5
g,h
i,j

Run Code Online (Sandbox Code Playgroud)

我想提供一种方法来迭代(仅)文件的数据行作为列名称映射到值(所有字符串).所以上面会变成:

Map(c1 -> a, c2 -> b, c3 -> c)
Map(c1 -> d, c2 -> e, c3 -> f)
Map(c4 -> g, c5 -> h)
Map(c4 -> i, c5 -> j)

Run Code Online (Sandbox Code Playgroud)

文件非常大,因此无法将所有内容读入内存.现在我有一个Iterator班级,在hasNext()和之间保持一些丑陋的状态next(); 我还提供当前行号的访问器和实际的最后一行和注释读取(如果消费者关心字段顺序).我想尝试以更实用的方式做事.

我的第一个想法是理解:我可以迭代文件的行,用过滤子句跳过注释行.我可以yield使用包含地图,行号等的元组.问题是我需要记住最后看到的列名,以便我可以从中创建地图.对于循环可以理解,试图阻止保持状态,只允许你设置新val的.我从中学到了这个问题,我可以更新成员变量yield块,而这正是我不希望在我的情况进行更新!

我可以在迭代子句中调用一个更新状态的函数,但这看起来很脏.那么,在功能风格中执行此操作的最佳方法是什么？滥用理解？哈克scanLeft？使用图书馆？带出解析器组合大枪吗？或者功能性风格是不是很适合这个问题？

Answer 1

Dan*_*ral 5

State Monad FTW!

实际上,我在State monad吮吸.我有一段时间写这篇文章,我有一种强烈的感觉,它可以做得更好.特别是,在我看来这traverse是要走的路,但......

// Get Scalaz on the job
import scalaz._
import Scalaz._

// Some type aliases to make stuff clearer
type Input         = Stream[String]
type Header        = String
type InternalState = (Input, Header)
type Output        = Option[(Header, String)]
type MyState       = State[InternalState, Output]

// Detect headers
def isHeader(line: String) = line(0) == '#'

// From a state, produce an output
def makeLine: (InternalState => Output) = {
    case (head #:: _, _) if isHeader(head) => None
    case (head #:: _, header)              => Some(header -> head)
    case _                                 => None
}

// From a state, produce the next state
def nextLine: (InternalState => InternalState) = {
    case (head #:: tail, _) if isHeader(head) => tail -> head
    case (_ #:: tail, header)                 => tail -> header
    case _                                    => Stream.empty -> ""
}

// My state is defined by the functions producing the next state
// and the output
val myState: MyState = state(s => nextLine(s) -> makeLine(s))    

// Some input to test it. I'm trimming it to avoid problems on REPL
val input = """#c1,c2,c3
a,b,c
d,e,f
#c4,c5
g,h
i,j""".lines.map(_.trim).toStream

// My State/Output Stream -- def to avoid keeping a reference to the head
def stateOutputStream = Stream.iterate(myState(input, "")){ 
        case (s, _) => myState(s) 
    } takeWhile { case ((stream, _), output) => stream.nonEmpty || output.nonEmpty }

// My Output Stream -- flatMap gets rid of the None from the headers
def outputStream = stateOutputStream flatMap { case (_, output) => output }

// Now just get the map
def outputToMap: (Header, String) => Map[String, String] = {
    case (header, line) =>
        val keys = header substring 1 split ","
        val values = line split ","
        keys zip values toMap
}

// And this is the result -- note that I'm still avoiding "val" so memory
// won't leak
def result = outputStream map outputToMap.tupled

Run Code Online (Sandbox Code Playgroud)

Answer 2

Did*_*ont 2

这是一个可能的解决方案：

首先看一下在满足谓词的每个元素处拆分列表 (Scala)的答案，这将为您提供一个 groupPrefix 函数。您将获得一个方法 groupPrefix，它将列表拆分为列表列表，当项目满足给定谓词时发生拆分。这样，您就可以分割以每个注释行（列定义）开始的列表，然后包含相应的数据

然后，该例程将转换相应映射列表中的子列表之一（从列名称开始）。

import scala.collection.immutable.ListMap 
  // to keep the order of the columns. If not needed, just use Map
def toNamedFields(lines: List[String]) : List[Map[String, String]] = {
  val columns = lines.head.tail.split(",").toList // tail to discard the #
  lines.tail.map{line => ListMap(columns.zip(line.split(",")): _*)}
}

Run Code Online (Sandbox Code Playgroud)

这样，您就可以分割行，获取每个组中的地图，获取地图列表的列表，然后使用展平将其转换为单个列表

groupPrefix(lines){_.startsWith("#")}.map(toNamedFields).flatten

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，5 月前
查看次数：	3171 次
最近记录：	14 年，4 月前