spark streaming fileStream

use*_*993 9 streaming scala apache-spark

我正在使用火花流编程但是在scala上遇到了一些问题.我正在尝试使用StreamingContext.fileStream函数

这个函数的定义是这样的:

def fileStream[K, V, F <: InputFormat[K, V]](directory: String)(implicit arg0: ClassManifest[K], arg1: ClassManifest[V], arg2: ClassManifest[F]): DStream[(K, V)]
Run Code Online (Sandbox Code Playgroud)

创建一个输入流,监视与Hadoop兼容的文件系统以获取新文件,并使用给定的键值类型和输入格式读取它们.文件名以.开头.被忽略了.K用于读取HDFS文件的密钥类型V用于读取HDFS文件的值类型F用于读取HDFS文件目录的输入格式用于监视新文件的HDFS目录

我不知道如何传递Key和Value的类型.火花流中的我的代码:

val ssc = new StreamingContext(args(0), "StreamingReceiver", Seconds(1),
  System.getenv("SPARK_HOME"), Seq("/home/mesos/StreamingReceiver.jar"))

// Create a NetworkInputDStream on target ip:port and count the
val lines = ssc.fileStream("/home/sequenceFile")
Run Code Online (Sandbox Code Playgroud)

用于编写hadoop文件的Java代码:

public class MyDriver {

private static final String[] DATA = { "One, two, buckle my shoe",
        "Three, four, shut the door", "Five, six, pick up sticks",
        "Seven, eight, lay them straight", "Nine, ten, a big fat hen" };

public static void main(String[] args) throws IOException {
    String uri = args[0];
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(URI.create(uri), conf);
    Path path = new Path(uri);
    IntWritable key = new IntWritable();
    Text value = new Text();
    SequenceFile.Writer writer = null;
    try {
        writer = SequenceFile.createWriter(fs, conf, path, key.getClass(),
                value.getClass());
        for (int i = 0; i < 100; i++) {
            key.set(100 - i);
            value.set(DATA[i % DATA.length]);
            System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key,
                    value);
            writer.append(key, value);
        }
    } finally {
        IOUtils.closeStream(writer);
    }
}
Run Code Online (Sandbox Code Playgroud)

}

cmb*_*ter 6

如果你想使用fileStream,你将不得不在调用它时提供所有3种类型的参数.在调用之前Key,您需要知道您的类型ValueInputFormat类型.如果您的类型是LongWritable,Text并且TextInputFormat,您会这样称呼fileStream:

val lines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/home/sequenceFile")
Run Code Online (Sandbox Code Playgroud)

如果这3种类型恰好是你的类型,那么你可能想要使用,textFileStream因为它不需要任何类型参数和委托fileStream使用我提到的那3种类型.使用它看起来像这样:

val lines = ssc.textFileStream("/home/sequenceFile")
Run Code Online (Sandbox Code Playgroud)