以大写字母分割字符串,但前提是 Python 中跟随有小写字母

-4 python split text-mining uppercase

我在 Python 中使用 pdfminer.six 来提取长文本数据。不幸的是,Miner 并不总是能很好地工作,尤其是在段落和文本换行方面。例如,我得到以下输出:

"2018Annual ReportInvesting for Growth and Market LeadershipOur CEO will provide you with all further details below."

--> "2018 Annual Report Investing for Growth and Market Leadership Our CEO will provide you with all further details below."
Run Code Online (Sandbox Code Playgroud)

现在我想在小写字母后跟大写字母然后是小写字母(以及数字)时插入一个空格。以至于最终"2018Annual"成为"2018 Annual""ReportInvesting"成为"Report Investing",却"...CEO..."依然"...CEO..."

我只找到了在大写字母/sf/answers/225134311/处拆分字符串的解决方案,但无法重写它。不幸的是,我在 Python 领域完全陌生。

Tim*_*sen 5

我们可以尝试re.sub在此处使用正则表达式方法:

inp = "2018Annual ReportInvesting for Growth and Market LeadershipOur CEO will provide you with all further details below."
inp = re.sub(r'(?<![A-Z\W])(?=[A-Z])', ' ', inp)
print(inp)
Run Code Online (Sandbox Code Playgroud)

这打印:

2018 Annual Report Investing for Growth and Market Leadership Our CEO will provide you with all further details below.
Run Code Online (Sandbox Code Playgroud)

这里使用的正则表达式表示在任何点插入一个空格:

(?<![A-Z\W])  what precedes is a word character EXCEPT
              for capital letters
(?=[A-Z])     and what follows is a capital letter
Run Code Online (Sandbox Code Playgroud)