Gau*_*pta 7 csv apache-spark pyspark pyspark-sql
我使用Spark 2.3.0。
作为Apache Spark的项目,我正在使用此数据集进行处理。当尝试使用spark读取csv时,spark数据帧中的行与csv(请参见此处的示例csv示例)文件中的正确行不对应。代码如下:
answer_df = sparkSession.read.csv('./stacksample/Answers_sample.csv', header=True, inferSchema=True, multiLine=True);
answer_df.show(2)
Run Code Online (Sandbox Code Playgroud)
输出量
+--------------------+-------------+--------------------+--------+-----+--------------------+
| Id| OwnerUserId| CreationDate|ParentId|Score| Body|
+--------------------+-------------+--------------------+--------+-----+--------------------+
| 92| 61|2008-08-01T14:45:37Z| 90| 13|"<p><a href=""htt...|
|<p>A very good re...| though.</p>"| null| null| null| null|
+--------------------+-------------+--------------------+--------+-----+--------------------+
only showing top 2 rows
Run Code Online (Sandbox Code Playgroud)
但是,当我使用熊猫时,它就像一种魅力。
df = pd.read_csv('./stacksample/Answers_sample.csv')
df.head(3)
Run Code Online (Sandbox Code Playgroud)
输出量
Index Id OwnerUserId CreationDate ParentId Score Body
0 92 61 2008-08-01T14:45:37Z 90 13 <p><a href="http://svnbook.red-bean.com/">Vers...
1 124 26 2008-08-01T16:09:47Z 80 12 <p>I wound up using this. It is a kind of a ha...
Run Code Online (Sandbox Code Playgroud)
我的观察: Apache spark将csv文件中的每一行都视为数据帧的记录(这是合理的),但是另一方面,大熊猫会智能地(不确定基于哪个参数)找出记录的实际结束位置。
我想知道的问题是,如何指示Spark正确加载数据帧。
如下所示,要加载的数据以两行开始92并124为两条记录。
Id,OwnerUserId,CreationDate,ParentId,Score,Body
92,61,2008-08-01T14:45:37Z,90,13,"<p><a href=""http://svnbook.red-bean.com/"">Version Control with Subversion</a></p>
<p>A very good resource for source control in general. Not really TortoiseSVN specific, though.</p>"
124,26,2008-08-01T16:09:47Z,80,12,"<p>I wound up using this. It is a kind of a hack, but it actually works pretty well. The only thing is you have to be very careful with your semicolons. : D</p>
<pre><code>var strSql:String = stream.readUTFBytes(stream.bytesAvailable);
var i:Number = 0;
var strSqlSplit:Array = strSql.split("";"");
for (i = 0; i < strSqlSplit.length; i++){
NonQuery(strSqlSplit[i].toString());
}
</code></pre>
"
Run Code Online (Sandbox Code Playgroud)
我认为您应该使用option("escape", "\"")它,因为它似乎"被用作所谓的引号转义字符。
val q = spark.read
.option("multiLine", true)
.option("header", true)
.option("escape", "\"")
.csv("input.csv")
scala> q.show
+---+-----------+--------------------+--------+-----+--------------------+
| Id|OwnerUserId| CreationDate|ParentId|Score| Body|
+---+-----------+--------------------+--------+-----+--------------------+
| 92| 61|2008-08-01T14:45:37Z| 90| 13|<p><a href="http:...|
|124| 26|2008-08-01T16:09:47Z| 80| 12|<p>I wound up usi...|
+---+-----------+--------------------+--------+-----+--------------------+
Run Code Online (Sandbox Code Playgroud)
经过几个小时的努力,我终于找到了解决方案。
分析:
提供的数据转储Stackoverflow已quote(")被另一个转义quote(")。而且由于spark使用了slash(\)我没有通过的转义字符的默认值,因此最终导致给出无意义的输出。
更新的代码
answer_df = sparkSession.read.\
csv('./stacksample/Answers_sample.csv',
inferSchema=True, header=True, multiLine=True, escape='"');
answer_df.show(2)
Run Code Online (Sandbox Code Playgroud)
请注意中使用escape参数csv()。
输出量
+---+-----------+-------------------+--------+-----+--------------------+
| Id|OwnerUserId| CreationDate|ParentId|Score| Body|
+---+-----------+-------------------+--------+-----+--------------------+
| 92| 61|2008-08-01 20:15:37| 90| 13|<p><a href="http:...|
|124| 26|2008-08-01 21:39:47| 80| 12|<p>I wound up usi...|
+---+-----------+-------------------+--------+-----+--------------------+
Run Code Online (Sandbox Code Playgroud)
希望它能帮助其他人并为他们节省一些时间。