这是与Spark 1.6的CDH .
我正在尝试将此假设CSV导入到Apache SparkFrame的apache中:
$ hadoop fs -cat test.csv
a,b,c,2016-09-09,a,2016-11-11 09:09:09.0,a
a,b,c,2016-09-10,a,2016-11-11 09:09:10.0,a
Run Code Online (Sandbox Code Playgroud)
我使用databricks-csv jar.
val textData = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter", ",")
.option("dateFormat", "yyyy-MM-dd HH:mm:ss")
.option("inferSchema", "true")
.option("nullValue", "null")
.load("test.csv")
Run Code Online (Sandbox Code Playgroud)
我使用inferSchema为生成的DataFrame制作模式.printSchema()函数为上面的代码提供了以下输出:
scala> textData.printSchema()
root
|-- C0: string (nullable = true)
|-- C1: string (nullable = true)
|-- C2: string (nullable = true)
|-- C3: string (nullable = true)
|-- C4: string (nullable = true)
|-- C5: timestamp (nullable = true)
|-- C6: string (nullable = …
Run Code Online (Sandbox Code Playgroud) 我有两个名为left和right的数据帧.
scala> left.printSchema
root
|-- user_uid: double (nullable = true)
|-- labelVal: double (nullable = true)
|-- probability_score: double (nullable = true)
scala> right.printSchema
root
|-- user_uid: double (nullable = false)
|-- real_labelVal: double (nullable = false)
Run Code Online (Sandbox Code Playgroud)
然后,我加入他们以获得加入的Dataframe.这是一个左外连接.任何对natjoin函数感兴趣的人都可以在这里找到它.
scala> val joinedData = natjoin(predictionDataFrame, labeledObservedDataFrame, "left_outer")
scala> joinedData.printSchema
|-- user_uid: double (nullable = true)
|-- labelVal: double (nullable = true)
|-- probability_score: double (nullable = true)
|-- real_labelVal: double (nullable = false)
Run Code Online (Sandbox Code Playgroud)
由于它是左外连接,因此当user_uid不在右边时,real_labelVal列具有空值.
scala> …
Run Code Online (Sandbox Code Playgroud) 我使用以下脚本使用numpy从标准输入中读取文件.
#!/usr/bin/env python
import numpy as np
import sys
data = np.genfromtxt(sys.stdin, delimiter=",")
print data.shape
print data
Run Code Online (Sandbox Code Playgroud)
这适用于具有多于1行的文件.但无法为此文件工作:
1,2,2,2,2,2,1,1,1
Run Code Online (Sandbox Code Playgroud)
我这样运行它
$ cat input-file.txt | ./test.py
Run Code Online (Sandbox Code Playgroud)
输出如下:
(9,)
[ 1. 2. 2. 2. 2. 2. 1. 1. 1.]
Run Code Online (Sandbox Code Playgroud)
它应该有形状(,9).有谁知道如何修理它?
我使用jsonschema2pojo-maven-plugin v0.4.7从JSON模式生成POJO类.示例模式如下:
"features": {
"title": "Feature",
"description": "Name and type of every feature in the model",
"type": "array",
"items": {
"properties": {
"columnName": {
"description": "Name of the table column",
"type": "string"
},
"featureName": {
"description": "Name of that column's feature for the pipeline",
"type": "string"
},
"type": {
"description": "Type of the feature",
"type": "string"
}
},
"required": ["columnName", "type"]
}
Run Code Online (Sandbox Code Playgroud)
得到的POJO类有点如下:
public class Feature {
/**
* Name of the table column
*
*/
@JsonProperty("columnName")
private …
Run Code Online (Sandbox Code Playgroud)