Ash*_*y O 10 python regex apache-spark pyspark databricks
我有一个数据帧,如:
ID Notes
2345 Checked by John
2398 Verified by Stacy
3983 Double Checked on 2/23/17 by Marsha
Run Code Online (Sandbox Code Playgroud)
比方说,例如,只有3名员工需要检查:John,Stacy或Marsha.我想像这样制作一个新专栏:
ID Notes Employee
2345 Checked by John John
2398 Verified by Stacy Stacy
3983 Double Checked on 2/23/17 by Marsha Marsha
Run Code Online (Sandbox Code Playgroud)
这里是正则表达式还是grep更好?我应该尝试什么样的功能?谢谢!
编辑:我一直在尝试一堆解决方案,但似乎没有任何工作.我应该放弃并为每个员工创建具有二进制值的列吗?IE:
ID Notes John Stacy Marsha
2345 Checked by John 1 0 0
2398 Verified by Stacy 0 1 0
3983 Double Checked on 2/23/17 by Marsha 0 0 1
Run Code Online (Sandbox Code Playgroud)
mrs*_*vas 19
regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))该表达式中提取雇员名从任何位置在那里之后是通过然后用空间(S)中的文字列(
col('Notes'))
创建示例数据框
data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),
('3983', 'Double Checked on 2/23/17 by Marsha')]
df = sc.parallelize(data).toDF(['ID', 'Notes'])
df.show()
+----+--------------------+
| ID| Notes|
+----+--------------------+
|2345| Checked by John|
|2398| Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+
Run Code Online (Sandbox Code Playgroud)
做所需的进口
from pyspark.sql.functions import regexp_extract, col
Run Code Online (Sandbox Code Playgroud)
在df提取Employee从列名使用regexp_extract(column_name, regex, group_number).
这里regex('(.)(by)(\s+)(\w+)')意味着
而group_number为4,因为group (\w+)在表达式中位于第4位
result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))
result.show()
+----+--------------------+--------+
| ID| Notes|Employee|
+----+--------------------+--------+
|2345| Checked by John| John|
|2398| Verified by Stacy| Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...| Marsha|
+----+--------------------+--------+
Run Code Online (Sandbox Code Playgroud)
regexp_extract(col('Notes'), '.by\s+(\w+)', 1))看起来更干净的版本并检查正在使用的正则表达式