如何有效检查Spark Dataframe中是否包含单词列表？

Question

如何有效检查Spark Dataframe中是否包含单词列表？

Kis*_*tai 3 python dataframe apache-spark pyspark

我正在尝试使用PySpark数据帧尽可能高效地执行以下操作。我有一个数据框，其中的一列包含文本和我要过滤行的单词列表。所以：

数据框看起来像这样

df:
col1    col2   col_with_text
a       b      foo is tasty
12      34     blah blahhh
yeh     0      bar of yums

Run Code Online (Sandbox Code Playgroud)

该列表将是list = [foo,bar] ，因此结果将是：

result:
col1    col2   col_with_text
a       b      foo
yeh     0      bar

Run Code Online (Sandbox Code Playgroud)

之后，不仅将完成相同的字符串匹配，而且还将通过使用SequenceMatcher进行测试以进行相似性测试。这是我已经尝试过的：

def check_keywords(x):
   words_list = ['foo','bar']

   for word in x
       if word == words_list[0] or word == words_list[1]:
           return x

result = df.map(lambda x: check_keywords(x)).collect()

Run Code Online (Sandbox Code Playgroud)

不幸的是我不成功，有人可以帮我吗？提前致谢。

Answer 1

MaF*_*aFF 7

您应该考虑使用pyspark sql模块函数而不是编写a UDF，它有几个regexp基础函数：

首先让我们从一个更完整的示例数据框架开始：

df = sc.parallelize([["a","b","foo is tasty"],["12","34","blah blahhh"],["yeh","0","bar of yums"], 
                     ['haha', '1', 'foobar none'], ['hehe', '2', 'something bar else']])\
    .toDF(["col1","col2","col_with_text"])

Run Code Online (Sandbox Code Playgroud)

如果要根据行是否包含中的单词之一来过滤行words_list，可以使用rlike：

import pyspark.sql.functions as psf
words_list = ['foo','bar']
df.filter(psf.col('col_with_text').rlike('(^|\s)(' + '|'.join(words_list) + ')(\s|$)')).show()

    +----+----+------------------+
    |col1|col2|     col_with_text|
    +----+----+------------------+
    |   a|   b|      foo is tasty|
    | yeh|   0|       bar of yums|
    |hehe|   2|something bar else|
    +----+----+------------------+

Run Code Online (Sandbox Code Playgroud)

如果要提取与正则表达式匹配的字符串，可以使用regexp_extract：

df.withColumn(
        'extracted_word', 
        psf.regexp_extract('col_with_text', '(?=^|\s)(' + '|'.join(words_list) + ')(?=\s|$)', 0))\
    .show()

    +----+----+------------------+--------------+
    |col1|col2|     col_with_text|extracted_word|
    +----+----+------------------+--------------+
    |   a|   b|      foo is tasty|           foo|
    |  12|  34|       blah blahhh|              |
    | yeh|   0|       bar of yums|           bar|
    |haha|   1|       foobar none|              |
    |hehe|   2|something bar else|              |
    +----+----+------------------+--------------+

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，3 月前
查看次数：	3440 次
最近记录：	6 年，7 月前