Java中List <String>的数据帧

dev*_*v ツ 1 java apache-spark spark-dataframe

  • Spark版本:1.6.2
  • Java版本:7

我有一个List<String>数据.就像是:

[[dev, engg, 10000], [karthik, engg, 20000]..]
Run Code Online (Sandbox Code Playgroud)

我知道这个数据的架构.

name (String)
degree (String)
salary (Integer)
Run Code Online (Sandbox Code Playgroud)

我试过了:

JavaRDD<String> data = new JavaSparkContext(sc).parallelize(datas);
DataFrame df = sqlContext.read().json(data);
df.printSchema();
df.show(false);
Run Code Online (Sandbox Code Playgroud)

输出:

root
 |-- _corrupt_record: string (nullable = true)


+-----------------------------+
|_corrupt_record              |
+-----------------------------+
|[dev, engg, 10000]           |
|[karthik, engg, 20000]       |
+-----------------------------+
Run Code Online (Sandbox Code Playgroud)

因为List<String>不是一个合适的JSON.

我是否需要创建正确的JSON,还是有其他方法可以做到这一点?

aba*_*hel 11

您可以从中创建DataFrame List<String>,然后使用 selectExprsplit获取所需的DataFrame.

public class SparkSample{
public static void main(String[] args) {
    SparkConf conf = new SparkConf().setAppName("SparkSample").setMaster("local[*]");
    JavaSparkContext jsc = new JavaSparkContext(conf);
    SQLContext sqc = new SQLContext(jsc);
    // sample data
    List<String> data = new ArrayList<String>();
    data.add("dev, engg, 10000");
    data.add("karthik, engg, 20000");
    // DataFrame
    DataFrame df = sqc.createDataset(data, Encoders.STRING()).toDF();
    df.printSchema();
    df.show();
    // Convert
    DataFrame df1 = df.selectExpr("split(value, ',')[0] as name", "split(value, ',')[1] as degree","split(value, ',')[2] as salary");
    df1.printSchema();
    df1.show(); 
   }
}
Run Code Online (Sandbox Code Playgroud)

你将获得低于输出.

root
 |-- value: string (nullable = true)

+--------------------+
|               value|
+--------------------+
|    dev, engg, 10000|
|karthik, engg, 20000|
+--------------------+

root
 |-- name: string (nullable = true)
 |-- degree: string (nullable = true)
 |-- salary: string (nullable = true)

+-------+------+------+
|   name|degree|salary|
+-------+------+------+
|    dev|  engg| 10000|
|karthik|  engg| 20000|
+-------+------+------+
Run Code Online (Sandbox Code Playgroud)

您提供的示例数据具有空格.如果你想删除空格并将薪水类型设置为"整数",那么你可以使用trimcast功能如下.

df1 = df1.select(trim(col("name")).as("name"),trim(col("degree")).??as("degree"),trim(co??l("salary")).cast("i??nteger").as("salary"??)); 
Run Code Online (Sandbox Code Playgroud)