具有多个条件的Sparksql过滤(使用where子句选择)

use*_*714 11 python sql apache-spark apache-spark-sql pyspark

嗨,我有以下问题:

numeric.registerTempTable("numeric"). 
Run Code Online (Sandbox Code Playgroud)

我要过滤的所有值都是文字空字符串,而不是N/A或Null值.

我试过这三个选项:

  1. numeric_filtered = numeric.filter(numeric['LOW'] != 'null').filter(numeric['HIGH'] != 'null').filter(numeric['NORMAL'] != 'null')

  2. numeric_filtered = numeric.filter(numeric['LOW'] != 'null' AND numeric['HIGH'] != 'null' AND numeric['NORMAL'] != 'null')

  3. sqlContext.sql("SELECT * from numeric WHERE LOW != 'null' AND HIGH != 'null' AND NORMAL != 'null'")

不幸的是,numeric_filtered总是空的.我检查并且数字具有应根据这些条件过滤的数据.

以下是一些示例值:

低高正常

3.5 5.0 null

2.0 14.0 null

null 38.0 null

null null null

1.0 null 4.0

zer*_*323 21

您正在使用逻辑连接(AND).这意味着所有列必须'null'与要包含的行不同.让我们举例说明使用filter版本作为示例:

numeric = sqlContext.createDataFrame([
    ('3.5,', '5.0', 'null'), ('2.0', '14.0', 'null'),  ('null', '38.0', 'null'),
    ('null', 'null', 'null'),  ('1.0', 'null', '4.0')],
    ('low', 'high', 'normal'))

numeric_filtered_1 = numeric.where(numeric['LOW'] != 'null')
numeric_filtered_1.show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

numeric_filtered_2 = numeric_filtered_1.where(
    numeric_filtered_1['NORMAL'] != 'null')
numeric_filtered_2.show()

## +---+----+------+
## |low|high|normal|
## +---+----+------+
## |1.0|null|   4.0|
## +---+----+------+

numeric_filtered_3 = numeric_filtered_2.where(
    numeric_filtered_2['HIGH'] != 'null')
numeric_filtered_3.show()

## +---+----+------+
## |low|high|normal|
## +---+----+------+
## +---+----+------+
Run Code Online (Sandbox Code Playgroud)

您尝试过的所有剩余方法都遵循完全相同的架构.你需要的是逻辑分离(OR).

from pyspark.sql.functions import col 

numeric_filtered = df.where(
    (col('LOW')    != 'null') | 
    (col('NORMAL') != 'null') |
    (col('HIGH')   != 'null'))
numeric_filtered.show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+
Run Code Online (Sandbox Code Playgroud)

或者使用原始SQL:

numeric.registerTempTable("numeric")
sqlContext.sql("""SELECT * FROM numeric
    WHERE low != 'null' OR normal != 'null' OR high != 'null'"""
).show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+
Run Code Online (Sandbox Code Playgroud)

另请参见:Pyspark:when子句中的多个条件

  • 谢谢!这很有帮助。重要的是将各个条件括在括号内**(** col('foo')=='bar'**)** (2认同)