熊猫在正则表达式上分裂

Question

熊猫在正则表达式上分裂

我有一个包含逗号分隔特征的列的 pandas df，如下所示：

Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect

Run Code Online (Sandbox Code Playgroud)

我想将此列拆分为多个虚拟变量列，但无法弄清楚如何开始此过程。我试图像这样拆分列：

df['incident_characteristics'].str.split(',', expand=True)

Run Code Online (Sandbox Code Playgroud)

但是，这不起作用，因为描述中间有逗号。相反，我需要根据逗号后跟空格和大写字母的正则表达式匹配进行拆分。str.split 可以使用正则表达式吗？如果是这样，这是如何完成的？

我认为这个正则表达式会做我需要的：

,\s[A-Z]

Run Code Online (Sandbox Code Playgroud)

Answer 1

Wik*_*żew 14

是的，split支持正则表达式。根据您的要求，

基于逗号后跟空格和大写字母的正则表达式匹配进行拆分

你可以使用

df['incident_characteristics'].str.split(r'\s*,\s*(?=[A-Z])', expand=True)

Run Code Online (Sandbox Code Playgroud)

请参阅正则表达式演示。

细节

\s*,\s* - 用 0+ 个空格括起来的逗号
(?=[A-Z]) - only if followed with an uppercase ASCII letter

However, it seems you also don't want to match the comma inside parentheses, add (?![^()]*\)) lookahead that fails the match if, immediately to the right of the current location, there are 0+ chars other than ( and ) and then a ):

r'\s*,\s*(?=[A-Z])(?![^()]*\))'

Run Code Online (Sandbox Code Playgroud)

and it will prevent matching commas before capitalized words inside parentheses (that has no parentheses inside).

See another regex demo.

归档时间：	8 年前
查看次数：	11540 次
最近记录：	5 年，8 月前