如何在spark的数据框中"否定选择"列

Bla*_*aer 19 scala dataframe apache-spark apache-spark-sql

我无法弄清楚,但猜测它很简单.我有一个火花数据帧df.该df具有列"A","B"和"C".现在假设我有一个包含此df列的名称的Array:

column_names = Array("A","B","C")
Run Code Online (Sandbox Code Playgroud)

我想以df.select()这样的方式做,我可以指定哪些列不要选择.示例:假设我不想选择列"B".我试过了

df.select(column_names.filter(_!="B"))
Run Code Online (Sandbox Code Playgroud)

但这不起作用

org.apache.spark.sql.DataFrame不能应用于(Array [String])

所以,在这里它说它应该与Seq一起使用.但是,尝试

df.select(column_names.filter(_!="B").toSeq)
Run Code Online (Sandbox Code Playgroud)

结果是

org.apache.spark.sql.DataFrame不能应用于(Seq [String]).

我究竟做错了什么?

zer*_*323 36

从Spark 1.4开始,你可以使用drop方法:

斯卡拉:

case class Point(x: Int, y: Int)
val df = sqlContext.createDataFrame(Point(0, 0) :: Point(1, 2) :: Nil)
df.drop("y")
Run Code Online (Sandbox Code Playgroud)

Python:

df = sc.parallelize([(0, 0), (1, 2)]).toDF(["x", "y"])
df.drop("y")
## DataFrame[x: bigint]
Run Code Online (Sandbox Code Playgroud)


Edi*_*ice 8

我遇到了同样的问题并以这种方式解决了(oaffdf是一个数据帧):

val dropColNames = Seq("col7","col121")
val featColNames = oaffdf.columns.diff(dropColNames)
val featCols = featColNames.map(cn => org.apache.spark.sql.functions.col(cn))
val featsdf = oaffdf.select(featCols: _*)
Run Code Online (Sandbox Code Playgroud)

https://forums.databricks.com/questions/2808/select-dataframe-columns-from-a-sequence-of-string.html


hui*_*ker 5

好吧,这很难看,但这个快速的火花壳会话显示了一些有效的方法:

scala> val myRDD = sc.parallelize(List.range(1,10))
myRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:21

scala> val myDF = myRDD.toDF("a")
myDF: org.apache.spark.sql.DataFrame = [a: int]

scala> val myOtherRDD = sc.parallelize(List.range(1,10))
myOtherRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:21

scala> val myotherDF = myRDD.toDF("b")
myotherDF: org.apache.spark.sql.DataFrame = [b: int]

scala> myDF.unionAll(myotherDF)
res2: org.apache.spark.sql.DataFrame = [a: int]

scala> myDF.join(myotherDF)
res3: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> val twocol = myDF.join(myotherDF)
twocol: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> val cols = Array("a", "b")
cols: Array[String] = Array(a, b)

scala> val selectedCols = cols.filter(_!="b")
selectedCols: Array[String] = Array(a)

scala> twocol.select(selectedCols.head, selectedCols.tail: _*)
res4: org.apache.spark.sql.DataFrame = [a: int]
Run Code Online (Sandbox Code Playgroud)

Providings varargs到一个需要一个功能的功能在其他 SO问题中被处理.select的签名是为了确保所选列的列表不为空 - 这使得从所选列的列表到varargs的转换更复杂一些.