如何超越这个正则表达式替换?

spe*_*der 7 c# regex string optimization

经过大量测量后,我发现了一个我想要优化的Windows服务中的热点.我们正在处理可能有多个连续空格的字符串,我们希望减少到只有一个空格.我们使用静态编译的正则表达式来执行此任务:

private static readonly Regex 
    regex_select_all_multiple_whitespace_chars = 
        new Regex(@"\s+",RegexOptions.Compiled);
Run Code Online (Sandbox Code Playgroud)

然后按如下方式使用它:

var cleanString=
    regex_select_all_multiple_whitespace_chars.Replace(dirtyString.Trim(), " ");
Run Code Online (Sandbox Code Playgroud)

这条线被调用了数百万次,并且被证明是相当密集的.我试着写一些更好的东西,但我很难过.鉴于正则表达式的处理要求相当适中,肯定会有更快的速度.可以unsafe用指针速度的东西进一步处理?

编辑:

感谢对这个问题的惊人反应......最让人意想不到的!

Guf*_*ffa 8

这大约快三倍:

private static string RemoveDuplicateSpaces(string text) {
  StringBuilder b = new StringBuilder(text.Length);
  bool space = false;
  foreach (char c in text) {
    if (c == ' ') {
      if (!space) b.Append(c);
      space = true;
    } else {
      b.Append(c);
      space = false;
    }
  }
  return b.ToString();
}
Run Code Online (Sandbox Code Playgroud)


Jen*_*und 7

这个怎么样...

public string RemoveMultiSpace(string test)
{
var words = test.Split(new char[] { ' ' }, 
    StringSplitOptions.RemoveEmptyEntries);
return string.Join(" ", words);
}
Run Code Online (Sandbox Code Playgroud)

使用NUnit运行测试用例:
测试时间以毫秒为单位.

Regex Test time: 338,8885
RemoveMultiSpace Test time: 78,9335
Run Code Online (Sandbox Code Playgroud)
private static readonly Regex regex_select_all_multiple_whitespace_chars =
   new Regex(@"\s+", RegexOptions.Compiled);

[Test]
public void Test()
{
    string startString = "A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      ";
    string cleanString;
    Trace.WriteLine("Regex Test start");
    int count = 10000;
    Stopwatch timer = new Stopwatch();
    timer.Start();
    for (int i = 0; i < count; i++)
    {
        cleanString = regex_select_all_multiple_whitespace_chars.Replace(startString, " ");
    }
    var elapsed = timer.Elapsed;
    Trace.WriteLine("Regex Test end");
    Trace.WriteLine("Regex Test time: " + elapsed.TotalMilliseconds);

    Trace.WriteLine("RemoveMultiSpace Test start");
    timer = new Stopwatch();
    timer.Start();
    for (int i = 0; i < count; i++)
    {
        cleanString = RemoveMultiSpace(startString);
    }
    elapsed = timer.Elapsed;
    Trace.WriteLine("RemoveMultiSpace Test end");
    Trace.WriteLine("RemoveMultiSpace Test time: " + elapsed.TotalMilliseconds);
}

public string RemoveMultiSpace(string test)
{
    var words = test.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
    return string.Join(" ", words);
}
Run Code Online (Sandbox Code Playgroud)

编辑:进行了
一些测试,并添加了基于StringBuilder的Guffa方法"RemoveDuplicateSpaces".
所以我的结论是,当存在大量空格时,StringBuilder方法更快,但是空格更少,字符串拆分方法稍快一些.

Cleaning file with about 30000 lines, 10 iterations
RegEx time elapsed: 608,0623
RemoveMultiSpace time elapsed: 239,2049
RemoveDuplicateSpaces time elapsed: 307,2044

Cleaning string, 10000 iterations:
A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      
RegEx time elapsed: 590,3626
RemoveMultiSpace time elapsed: 159,4547
RemoveDuplicateSpaces time elapsed: 137,6816

Cleaning string, 10000 iterations:
A      B      C      D      E      F      A      B      C      D      E      F      A      B      C      D      E      F      A      B      C      D      E      F      A      B      C      D      E      F      A      B      C      D      E      F      A      B      C      D      E      F      A      B      C      D      E      F      
RegEx time elapsed: 290,5666
RemoveMultiSpace time elapsed: 64,6776
RemoveDuplicateSpaces time elapsed: 52,4732

Run Code Online (Sandbox Code Playgroud)


Kob*_*obi 6

目前,您正在用另一个空格替换单个空格.尝试匹配\s{2,}(或类似的东西,如果你想替换单个换行符和其他字符).