use*_*018 9 apache-spark apache-spark-sql pyspark
我必须使用多种模式来过滤大文件。问题是我不确定使用rlike. 举个例子
df = spark.createDataFrame(
[
('www 17 north gate',),
('aaa 45 north gate',),
('bbb 56 west gate',),
('ccc 56 south gate',),
('Michigan gate',),
('Statue of Liberty',),
('57 adam street',),
('19 west main street',),
('street burger',)
],
[ 'poi']
)
df.show()
+-------------------+
| poi|
+-------------------+
| www 17 north gate|
| aaa 45 north gate|
| bbb 56 west gate|
| ccc 56 south gate|
| Michigan gate|
| Statue of Liberty|
| 57 adam street|
|19 west main street|
| street burger|
+-------------------+
Run Code Online (Sandbox Code Playgroud)
如果我使用我可以做的数据中的以下两种模式
pat1="(aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$"
pat2="[0-9]+ [a-z\s]+ street$"
df.filter(~df['poi'].rlike(pat2)).filter(~df['poi'].rlike(pat1)).show()
+-----------------+
| poi|
+-----------------+
|www 45 north gate|
| Michigan gate|
|Statue of Liberty|
| street burger|
+-----------------+
Run Code Online (Sandbox Code Playgroud)
如果我有 40 种不同的图案怎么样?我想我可以使用这样的循环
for pat in [pat1,pat2,....,patn]:
df = df.filter(~df['poi'].rlike(pat))
Run Code Online (Sandbox Code Playgroud)
这是正确的方法吗?原始数据是中文的,所以模式是否有效请忽略。我只是想看看如何处理多个正则表达式模式。
您建议的两种方法具有相同的执行计划:
连续使用两种模式:
df.filter(~df['poi'].rlike(pat2)).filter(~df['poi'].rlike(pat1)).explain()
#== Physical Plan ==
#*Filter (NOT poi#297 RLIKE [0-9]+ [a-z\s]+ street$ &&
# NOT poi#297 RLIKE (aaa|bbb|ccc) [0-#9]+ (north|south|west|east) gate$)
#+- Scan ExistingRDD[poi#297]
Run Code Online (Sandbox Code Playgroud)
使用循环:
# this is the same as your loop
df_new = reduce(lambda df, pat: df.filter(~df['poi'].rlike(pat)), [pat1, pat2], df)
df_new.explain()
#== Physical Plan ==
#*Filter (NOT poi#297 RLIKE (aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$ &&
# NOT poi#297 RLIKE [0-9]+ [a-z\s]+ street$)
#+- Scan ExistingRDD[poi#297]
Run Code Online (Sandbox Code Playgroud)
另一种方法是将所有模式组合成一个,使用"|".join()正则表达式运算符将它们链接在一起or。主要区别在于,这只会导致一次调用rlike(与其他方法中每个模式一次调用相反):
df.filter(~df['poi'].rlike("|".join([pat1, pat2]))).explain()
#== Physical Plan ==
#*Filter NOT poi#297 RLIKE (aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$|[0-9]+ [a-#z\s]+ street$
#+- Scan ExistingRDD[poi#297]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
21988 次 |
| 最近记录: |