我有一个包含多个分类列的数据框。我正在尝试使用两列之间的内置函数查找卡方统计信息:
from pyspark.ml.stat import ChiSquareTest
r = ChiSquareTest.test(df, 'feature1', 'feature2')
Run Code Online (Sandbox Code Playgroud)
但是,它给了我错误:
IllegalArgumentException: 'requirement failed: Column feature1 must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually double.'
Run Code Online (Sandbox Code Playgroud)
的数据类型feature1是:
feature1: double (nullable = true)
Run Code Online (Sandbox Code Playgroud)
你能帮我解决这个问题吗?
df.groupBy("col1", "col2", "col3") 工作得非常好.
但是,当我尝试执行以下操作时:
val dimensions = Seq("col1", "col2", "col3")
df.groupBy(dimensions)
Run Code Online (Sandbox Code Playgroud)
我收到这个错误:
<console>:38: error: overloaded method value groupBy with alternatives:
(col1: String,cols: String*)org.apache.spark.sql.GroupedData <and>
(cols: org.apache.spark.sql.Column*)org.apache.spark.sql.GroupedData
cannot be applied to (Seq[String])
Run Code Online (Sandbox Code Playgroud) 我想做这样的事情:
df
.withColumn("newCol", <some formula>)
.filter(s"""newCol > ${(math.min(max("newCol").asInstanceOf[Double],10))}""")
Run Code Online (Sandbox Code Playgroud)
我得到的例外:
org.apache.spark.sql.Column cannot be cast to java.lang.Double
Run Code Online (Sandbox Code Playgroud)
你能建议我一种方法来实现我想要的吗?