del*_*nce 3 python regex string text mining
我有一篇包含单词和数字的文本。我将给出一个有代表性的文本示例:
string = "This is a 1example of the text. But, it only is 2.5 percent of all data"
Run Code Online (Sandbox Code Playgroud)
我想将其转换为类似的东西:
"This is a 1 example of the text But it only is 2.5 percent of all data"
Run Code Online (Sandbox Code Playgroud)
因此,删除标点符号(可以是.
,
或 中的任何其他标点符号string.punctuation
),并在连接时在数字和单词之间放置空格。但在我的示例中保持浮点数为 2.5。
我使用了以下代码:
item = "This is a 1example of the text. But, it only is 2.5 percent of all data"
item = ' '.join(re.sub( r"([A-Z])", r" \1", item).split())
# This a start but not there yet !
#item = ' '.join([x.strip(string.punctuation) for x in item.split() if x not in string.digits])
item = ' '.join(re.split(r'(\d+)', item) )
print item
Run Code Online (Sandbox Code Playgroud)
结果是:
>> "This is a 1 example of the text. But, it only is 2 . 5 percent of all data"
Run Code Online (Sandbox Code Playgroud)
我快到了,但无法弄清楚最后的平静。
您可以使用正则表达式查找,如下所示:
(?<!\d)[.,;:](?!\d)
Run Code Online (Sandbox Code Playgroud)
这个想法是让一个字符类收集您想要替换的标点符号,并使用环视来匹配周围没有数字的标点符号
regex = r"(?<!\d)[.,;:](?!\d)"
test_str = "This is a 1example of the text. But, it only is 2.5 percent of all data"
result = re.sub(regex, "", test_str, 0)
Run Code Online (Sandbox Code Playgroud)
结果是:
This is a 1example of the text But it only is 2.5 percent of all data
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
7746 次 |
最近记录: |