Alb*_*nto 35 python dataframe apache-spark apache-spark-sql pyspark
我想DataFrame使用与列长度相关的条件来过滤a ,这个问题可能很容易,但我没有在SO中找到任何相关的问题.
更具体的,我有一个DataFrame只有一个Column,其中ArrayType(StringType()),我要筛选的DataFrame使用长度filterer,我拍下面的一个片段.
df = sqlContext.read.parquet("letters.parquet")
df.show()
# The output will be
# +------------+
# | tokens|
# +------------+
# |[L, S, Y, S]|
# |[L, V, I, S]|
# |[I, A, N, A]|
# |[I, L, S, A]|
# |[E, N, N, Y]|
# |[E, I, M, A]|
# |[O, A, N, A]|
# | [S, U, S]|
# +------------+
# But I want only the entries with length 3 or less
fdf = df.filter(len(df.tokens) <= 3)
fdf.show() # But it says that the TypeError: object of type 'Column' has no len(), so the previous statement is obviously incorrect.
Run Code Online (Sandbox Code Playgroud)
我阅读了Column的文档,但没有找到任何对这个问题有用的属性.我感谢任何帮助!
zer*_*323 60
在Spark> = 1.5中你可以使用size函数:
from pyspark.sql.functions import col, size
df = sqlContext.createDataFrame([
(["L", "S", "Y", "S"], ),
(["L", "V", "I", "S"], ),
(["I", "A", "N", "A"], ),
(["I", "L", "S", "A"], ),
(["E", "N", "N", "Y"], ),
(["E", "I", "M", "A"], ),
(["O", "A", "N", "A"], ),
(["S", "U", "S"], )],
("tokens", ))
df.where(size(col("tokens")) <= 3).show()
## +---------+
## | tokens|
## +---------+
## |[S, U, S]|
## +---------+
Run Code Online (Sandbox Code Playgroud)
在Spark <1.5中,UDF应该可以解决这个问题:
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
size_ = udf(lambda xs: len(xs), IntegerType())
df.where(size_(col("tokens")) <= 3).show()
## +---------+
## | tokens|
## +---------+
## |[S, U, S]|
## +---------+
Run Code Online (Sandbox Code Playgroud)
如果您使用的HiveContext则sizeUDF与原始的SQL应与任何版本:
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE size(tokens) <= 3").show()
## +--------------------+
## | tokens|
## +--------------------+
## |ArrayBuffer(S, U, S)|
## +--------------------+
Run Code Online (Sandbox Code Playgroud)
对于字符串列,您可以使用udf上面定义的length函数或函数:
from pyspark.sql.functions import length
df = sqlContext.createDataFrame([("fooo", ), ("bar", )], ("k", ))
df.where(length(col("k")) <= 3).show()
## +---+
## | k|
## +---+
## |bar|
## +---+
Run Code Online (Sandbox Code Playgroud)
小智 5
以下是 scala 中字符串的示例:
val stringData = Seq(("Maheswara"), ("Mokshith"))
val df = sc.parallelize(stringData).toDF
df.where((length($"value")) <= 8).show
+--------+
| value|
+--------+
|Mokshith|
+--------+
df.withColumn("length", length($"value")).show
+---------+------+
| value|length|
+---------+------+
|Maheswara| 9|
| Mokshith| 8|
+---------+------+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
50762 次 |
| 最近记录: |