dev*_*v ツ 1 java apache-spark spark-dataframe
我有一个List<String>
数据.就像是:
[[dev, engg, 10000], [karthik, engg, 20000]..]
Run Code Online (Sandbox Code Playgroud)
我知道这个数据的架构.
name (String)
degree (String)
salary (Integer)
Run Code Online (Sandbox Code Playgroud)
我试过了:
JavaRDD<String> data = new JavaSparkContext(sc).parallelize(datas);
DataFrame df = sqlContext.read().json(data);
df.printSchema();
df.show(false);
Run Code Online (Sandbox Code Playgroud)
输出:
root
|-- _corrupt_record: string (nullable = true)
+-----------------------------+
|_corrupt_record |
+-----------------------------+
|[dev, engg, 10000] |
|[karthik, engg, 20000] |
+-----------------------------+
Run Code Online (Sandbox Code Playgroud)
因为List<String>
不是一个合适的JSON.
我是否需要创建正确的JSON,还是有其他方法可以做到这一点?
aba*_*hel 11
您可以从中创建DataFrame List<String>
,然后使用 selectExpr
和split
获取所需的DataFrame.
public class SparkSample{
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("SparkSample").setMaster("local[*]");
JavaSparkContext jsc = new JavaSparkContext(conf);
SQLContext sqc = new SQLContext(jsc);
// sample data
List<String> data = new ArrayList<String>();
data.add("dev, engg, 10000");
data.add("karthik, engg, 20000");
// DataFrame
DataFrame df = sqc.createDataset(data, Encoders.STRING()).toDF();
df.printSchema();
df.show();
// Convert
DataFrame df1 = df.selectExpr("split(value, ',')[0] as name", "split(value, ',')[1] as degree","split(value, ',')[2] as salary");
df1.printSchema();
df1.show();
}
}
Run Code Online (Sandbox Code Playgroud)
你将获得低于输出.
root
|-- value: string (nullable = true)
+--------------------+
| value|
+--------------------+
| dev, engg, 10000|
|karthik, engg, 20000|
+--------------------+
root
|-- name: string (nullable = true)
|-- degree: string (nullable = true)
|-- salary: string (nullable = true)
+-------+------+------+
| name|degree|salary|
+-------+------+------+
| dev| engg| 10000|
|karthik| engg| 20000|
+-------+------+------+
Run Code Online (Sandbox Code Playgroud)
您提供的示例数据具有空格.如果你想删除空格并将薪水类型设置为"整数",那么你可以使用trim
和cast
功能如下.
df1 = df1.select(trim(col("name")).as("name"),trim(col("degree")).??as("degree"),trim(co??l("salary")).cast("i??nteger").as("salary"??));
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
12701 次 |
最近记录: |