Nan*_*ndu 5 apache-spark apache-spark-sql pyspark apache-spark-2.0
如何在连接操作后将具有相同列名的数据框写入 csv 文件。目前我正在使用以下代码。dfFinal.coalesce(1).write.format('com.databricks.spark.csv').save('/home/user/output/',header = 'true')它将在“/home/user/output”中写入数据帧“dfFinal”。但是在数据帧包含重复列的情况下它不起作用。下面是 dfFinal 数据框。
+----------+---+-----------------+---+-----------------+
| NUMBER | ID|AMOUNT | ID| AMOUNT|
+----------+---+-----------------+---+-----------------+
|9090909092| 1| 30| 1| 40|
|9090909093| 2| 30| 2| 50|
|9090909090| 3| 30| 3| 60|
|9090909094| 4| 30| 4| 70|
+----------+---+-----------------+---+-----------------+
Run Code Online (Sandbox Code Playgroud)
上述数据帧是在连接操作后形成的。写入 csv 文件时,它给了我以下错误。
pyspark.sql.utils.AnalysisException: u'Found duplicate column(s) when inserting into file:/home/user/output: `amount`, `id`;'
Run Code Online (Sandbox Code Playgroud)
当您将连接列指定为字符串或数组类型时,它将仅导致一列 [1]。Pyspark 示例:
l = [('9090909092',1,30),('9090909093',2,30),('9090909090',3,30),('9090909094',4,30)]
r = [(1,40),(2,50),(3,60),(4,70)]
left = spark.createDataFrame(l, ['NUMBER','ID','AMOUNT'])
right = spark.createDataFrame(r,['ID','AMOUNT'])
df = left.join(right, "ID")
df.show()
+---+----------+------+------+
| ID| NUMBER |AMOUNT|AMOUNT|
+---+----------+------+------+
| 1 |9090909092| 30 | 40 |
| 3 |9090909090| 30 | 60 |
| 2 |9090909093| 30 | 50 |
| 4 |9090909094| 30 | 70 |
+---+----------+------+------+
Run Code Online (Sandbox Code Playgroud)
但这仍然会在数据框中为所有不是连接列的列(本例中为 AMOUNT 列)产生重复的列名称。对于这些类型的列,您应该在使用 toDF 数据框函数 [2] 连接之前或之后指定一个新名称:
newNames = ['ID','NUMBER', 'LAMOUNT', 'RAMOUNT']
df= df.toDF(*newNames)
df.show()
+---+----------+-------+-------+
| ID| NUMBER |LAMOUNT|RAMOUNT|
+---+----------+-------+-------+
| 1 |9090909092| 30 | 40 |
| 3 |9090909090| 30 | 60 |
| 2 |9090909093| 30 | 50 |
| 4 |9090909094| 30 | 70 |
+---+----------+-------+-------+
Run Code Online (Sandbox Code Playgroud)
[1] https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicate-column.html
[2] http://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.toDF
| 归档时间: |
|
| 查看次数: |
6899 次 |
| 最近记录: |