替换不需要的字符时,如何防止某些单词一起运行?

B. *_*non 1 c# regex replace trim

我想删除所有字符,如逗号、句点、引号等,这样一行:

婴儿汉斯·帕特里克以通常的方式接受了他的乳头膏,而不是通过专利瓶的工具。当他还是个孩子的时候,他的反复无常之一就是当他受到父母严厉的惩罚时,用他的小肺用尽全力尖叫。这种独特的习惯不过是使他在成熟中如此杰出的天才的预示。

...将转换为以下内容:

The infant Hans Patrick received his mammarial balm in the usual way and not through the instrumentality of a patent bottle One of his caprices when yet a child was to scream with all the force of his little lungs when he was severely chastised by his parents This singular habit was but a foreshadowing of that genius which has rendered him so eminent in his maturity
Run Code Online (Sandbox Code Playgroud)

通过这种方式,我可以在空格处拆分单个单词,并且单词末尾没有标点符号。

我正在尝试使用以下代码来做到这一点:

Regex onlyAlphanumericSpaceApostropheAndHyphen = new Regex("[^a-zA-Z0-9 '-]");
. . .
doc1StrArray = File.ReadAllLines(sDoc1Path, Encoding.UTF8);
. . .
foreach (string line in doc1StrArray) 
{
    trimmedLine = line;
    trimmedLine = trimmedLine.Replace("—", " ");
    trimmedLine = onlyAlphanumericSpaceApostropheAndHyphen.Replace(trimmedLine, "");
    string[] subWords = trimmedLine.Split();
Run Code Online (Sandbox Code Playgroud)

...但它并非在所有情况下都有效,我不明白为什么它通常有效,但有时会去掉空格字符,从而将两个单词放在一起,以便在单步执行第二行后该行最终是这样上面的代码:

婴儿汉斯帕特里克以通常的方式而不是通过专利瓶的工具获得了他的乳头膏当他的一个反复无常的时候,当他被父母严厉惩罚时,他的小肺用尽全力尖叫这个独特的习惯只是预示那个使他在成熟中出类拔萃的天才

因此,有些单词会合并为一个单词(它们之间没有空格):

theusual
patentbottle
screamwith
severelychastised
aforeshadowing
soeminent
Run Code Online (Sandbox Code Playgroud)

为什么会发生这种情况,我该如何防止它继续发生?

Nic*_*ick 7

这些单词之间的空格似乎不是空格字符。考虑到文本在固定宽度字体中的样子,在第一期 ( the usual) 中被破坏:

The infant Hans Patrick received his mammarial balm in the 
usual way, and not through the instrumentality of a patent 
bottle. One of his caprices, when yet a child, was to scream 
with all the force of his little lungs, when he was severely 
chastised by his parents. This singular habit was but a 
foreshadowing of that genius which has rendered him so 
eminent in his maturity.
Run Code Online (Sandbox Code Playgroud)

它显示了在换行符处发生的所有问题,看起来它们是换行符。您可以通过将正则表达式中的空格更改\s为保留所有形式的空格来解决此问题(注意\必须在 ac# 正则表达式中转义):

Regex onlyAlphanumericSpaceApostropheAndHyphen = new Regex("[^a-zA-Z0-9\\s'-]"); 
Run Code Online (Sandbox Code Playgroud)