如何在Spark Java中遍历/迭代数据集?

Abh*_* Vk 5 java iterator apache-spark apache-spark-dataset apache-spark-2.0

我试图遍历数据集来进行一些字符串相似度计算,如Jaro winkler或Cosine Similarity.我将我的数据集转换为行列表,然后遍历for语句,这不是有效的火花方式.所以我期待在Spark中采用更好的方法.

public class sample {

    public static void main(String[] args) {
        JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("Example").setMaster("local[*]"));
        SQLContext sqlContext = new SQLContext(sc);
        SparkSession spark = SparkSession.builder().appName("JavaTokenizerExample").getOrCreate();

        List<Row> data = Arrays.asList(RowFactory.create("Mysore","Mysuru"),
                RowFactory.create("Name","FirstName"));
        StructType schema = new StructType(
                new StructField[] { new StructField("Word1", DataTypes.StringType, true, Metadata.empty()),
                        new StructField("Word2", DataTypes.StringType, true, Metadata.empty()) });

        Dataset<Row> oldDF = spark.createDataFrame(data, schema);
        oldDF.show();
        List<Row> rowslist = oldDF.collectAsList(); 
    }
}
Run Code Online (Sandbox Code Playgroud)

我发现了许多我不清楚的JavaRDD示例.数据集的示例将对我有所帮助.

aba*_*hel 20

你可以使用org.apache.spark.api.java.function.ForeachFunction如下.

oldDF.foreach((ForeachFunction<Row>) row -> System.out.println(row));
Run Code Online (Sandbox Code Playgroud)