use*_*714 11 python sql apache-spark apache-spark-sql pyspark
嗨,我有以下问题:
numeric.registerTempTable("numeric").
Run Code Online (Sandbox Code Playgroud)
我要过滤的所有值都是文字空字符串,而不是N/A或Null值.
我试过这三个选项:
numeric_filtered = numeric.filter(numeric['LOW'] != 'null').filter(numeric['HIGH'] != 'null').filter(numeric['NORMAL'] != 'null')
numeric_filtered = numeric.filter(numeric['LOW'] != 'null' AND numeric['HIGH'] != 'null' AND numeric['NORMAL'] != 'null')
sqlContext.sql("SELECT * from numeric WHERE LOW != 'null' AND HIGH != 'null' AND NORMAL != 'null'")
不幸的是,numeric_filtered总是空的.我检查并且数字具有应根据这些条件过滤的数据.
以下是一些示例值:
低高正常
3.5 5.0 null
2.0 14.0 null
null 38.0 null
null null null
1.0 null 4.0
zer*_*323 21
您正在使用逻辑连接(AND).这意味着所有列必须'null'与要包含的行不同.让我们举例说明使用filter版本作为示例:
numeric = sqlContext.createDataFrame([
('3.5,', '5.0', 'null'), ('2.0', '14.0', 'null'), ('null', '38.0', 'null'),
('null', 'null', 'null'), ('1.0', 'null', '4.0')],
('low', 'high', 'normal'))
numeric_filtered_1 = numeric.where(numeric['LOW'] != 'null')
numeric_filtered_1.show()
## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0| null|
## | 2.0|14.0| null|
## | 1.0|null| 4.0|
## +----+----+------+
numeric_filtered_2 = numeric_filtered_1.where(
numeric_filtered_1['NORMAL'] != 'null')
numeric_filtered_2.show()
## +---+----+------+
## |low|high|normal|
## +---+----+------+
## |1.0|null| 4.0|
## +---+----+------+
numeric_filtered_3 = numeric_filtered_2.where(
numeric_filtered_2['HIGH'] != 'null')
numeric_filtered_3.show()
## +---+----+------+
## |low|high|normal|
## +---+----+------+
## +---+----+------+
Run Code Online (Sandbox Code Playgroud)
您尝试过的所有剩余方法都遵循完全相同的架构.你需要的是逻辑分离(OR).
from pyspark.sql.functions import col
numeric_filtered = df.where(
(col('LOW') != 'null') |
(col('NORMAL') != 'null') |
(col('HIGH') != 'null'))
numeric_filtered.show()
## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0| null|
## | 2.0|14.0| null|
## |null|38.0| null|
## | 1.0|null| 4.0|
## +----+----+------+
Run Code Online (Sandbox Code Playgroud)
或者使用原始SQL:
numeric.registerTempTable("numeric")
sqlContext.sql("""SELECT * FROM numeric
WHERE low != 'null' OR normal != 'null' OR high != 'null'"""
).show()
## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0| null|
## | 2.0|14.0| null|
## |null|38.0| null|
## | 1.0|null| 4.0|
## +----+----+------+
Run Code Online (Sandbox Code Playgroud)
另请参见:Pyspark:when子句中的多个条件