Gau*_*sal 3 apache-spark pyspark sparkr
使用pyspark或者sparkr(最好是两个),如何获得两DataFrame列的交集?例如,sparkr我有以下内容DataFrames:
newHires <- data.frame(name = c("Thomas", "George", "George", "John"),
surname = c("Smith", "Williams", "Brown", "Taylor"))
salesTeam <- data.frame(name = c("Lucas", "Bill", "George"),
surname = c("Martin", "Clark", "Williams"))
newHiresDF <- createDataFrame(newHires)
salesTeamDF <- createDataFrame(salesTeam)
#Intersect works for the entire DataFrames
newSalesHire <- intersect(newHiresDF, salesTeamDF)
head(newSalesHire)
name surname
1 George Williams
#Intersect does not work for single columns
newSalesHire <- intersect(newHiresDF$name, salesTeamDF$name)
head(newSalesHire)
Run Code Online (Sandbox Code Playgroud)
我怎样才能intersect为单列工作?
您需要两个Spark DataFrame来使用intersect函数.您可以使用select函数从每个DataFrame中获取特定列.
在SparkR中:
newSalesHire <- intersect(select(newHiresDF, 'name'), select(salesTeamDF,'name'))
Run Code Online (Sandbox Code Playgroud)
在pyspark:
newSalesHire = newHiresDF.select('name').intersect(salesTeamDF.select('name'))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
10131 次 |
| 最近记录: |