-4 python split text-mining uppercase
我在 Python 中使用 pdfminer.six 来提取长文本数据。不幸的是,Miner 并不总是能很好地工作,尤其是在段落和文本换行方面。例如,我得到以下输出:
"2018Annual ReportInvesting for Growth and Market LeadershipOur CEO will provide you with all further details below."
--> "2018 Annual Report Investing for Growth and Market Leadership Our CEO will provide you with all further details below."
Run Code Online (Sandbox Code Playgroud)
现在我想在小写字母后跟大写字母然后是小写字母(以及数字)时插入一个空格。以至于最终"2018Annual"成为"2018 Annual","ReportInvesting"成为"Report Investing",却"...CEO..."依然"...CEO..."。
我只找到了在大写字母和/sf/answers/225134311/处拆分字符串的解决方案,但无法重写它。不幸的是,我在 Python 领域完全陌生。
我们可以尝试re.sub在此处使用正则表达式方法:
inp = "2018Annual ReportInvesting for Growth and Market LeadershipOur CEO will provide you with all further details below."
inp = re.sub(r'(?<![A-Z\W])(?=[A-Z])', ' ', inp)
print(inp)
Run Code Online (Sandbox Code Playgroud)
这打印:
2018 Annual Report Investing for Growth and Market Leadership Our CEO will provide you with all further details below.
Run Code Online (Sandbox Code Playgroud)
这里使用的正则表达式表示在任何点插入一个空格:
(?<![A-Z\W]) what precedes is a word character EXCEPT
for capital letters
(?=[A-Z]) and what follows is a capital letter
Run Code Online (Sandbox Code Playgroud)