Pau*_*ith 10 c# regex stringbuilder
我正在构建一个压力测试客户端,它使用尽可能多的线程来锤击服务器并分析响应,因为客户端可以集合.我经常发现自己受到垃圾收集(和/或缺乏收集)的限制,并且在大多数情况下,它归结为我实例化的字符串,只是将它们传递给Regex或Xml解析例程.
如果您对Regex类进行反编译,您将在内部看到它使用StringBuilders来执行几乎所有操作,但是您不能将它传递给字符串构建器; 在开始使用私有方法之前,它有助于深入研究私有方法,因此扩展方法也不会解决它.如果您想从System.Xml.Linq中的解析器中获取对象图,则处于类似情况.
这不是一个迂腐过度优化的案例.我查看了StringBuilder问题和其他问题中的正则表达式替换.我还介绍了我的应用程序,看看天花板的来源,Regex.Replace()现在使用确实在方法链中引入了大量开销,我试图以每小时数百万的请求命中服务器,并检查XML响应中的错误和嵌入式诊断代码.我已经摆脱了限制吞吐量的所有其他低效率,并且当我不需要捕获组或反向引用时,我甚至通过扩展StringBuilder来进行通配符查找/替换,从而减少了大量的Regex开销.但在我看来,现在有人会把自定义的StringBuilder(或更好的,基于Stream)的Regex和Xml解析实用程序包起来.
好吧,如此咆哮,但我自己必须这样做吗?
更新:我找到了一个解决方法,将峰值内存消耗从几千兆字节降低到几百兆,所以我将其发布在下面.我不是把它作为答案添加因为a)我一般不喜欢这样做,而且b)我仍然想知道是否有人花时间定制StringBuilder来做Regexes(反之亦然).
在我的情况下,我无法使用XmlReader,因为我正在摄取的流包含某些元素中的一些无效二进制内容.为了解析XML,我必须清空这些元素.我以前使用单个静态编译的Regex实例进行替换,这就像疯了一样消耗内存(我正在尝试处理~300个10KB docs/sec).大幅减少消费的变化是:
IndexOf方法.WildcardReplace方法,每次调用允许一个通配符(*或?) WildcardReplace()调用来替换正则表达式的用法,以清空有问题的元素的内容这是非常不合适的,仅在我自己的目的要求下进行测试; 我会让它更优雅和强大,但YAGNI和所有这一切,我很匆忙.这是代码:
/// <summary>
/// Performs basic wildcard find and replace on a string builder, observing one of two
/// wildcard characters: * matches any number of characters, or ? matches a single character.
/// Operates on only one wildcard per invocation; 2 or more wildcards in <paramref name="find"/>
/// will cause an exception.
/// All characters in <paramref name="replaceWith"/> are treated as literal parts of
/// the replacement text.
/// </summary>
/// <param name="find"></param>
/// <param name="replaceWith"></param>
/// <returns></returns>
public static StringBuilder WildcardReplace(this StringBuilder sb, string find, string replaceWith) {
if (find.Split(new char[] { '*' }).Length > 2 || find.Split(new char[] { '?' }).Length > 2 || (find.Contains("*") && find.Contains("?"))) {
throw new ArgumentException("Only one wildcard is supported, but more than one was supplied.", "find");
}
// are we matching one character, or any number?
bool matchOneCharacter = find.Contains("?");
string[] parts = matchOneCharacter ?
find.Split(new char[] { '?' }, StringSplitOptions.RemoveEmptyEntries)
: find.Split(new char[] { '*' }, StringSplitOptions.RemoveEmptyEntries);
int startItemIdx;
int endItemIdx;
int newStartIdx = 0;
int length;
while ((startItemIdx = sb.IndexOf(parts[0], newStartIdx)) > 0
&& (endItemIdx = sb.IndexOf(parts[1], startItemIdx + parts[0].Length)) > 0) {
length = (endItemIdx + parts[1].Length) - startItemIdx;
newStartIdx = startItemIdx + replaceWith.Length;
// With "?" wildcard, find parameter length should equal the length of its match:
if (matchOneCharacter && length > find.Length)
break;
sb.Remove(startItemIdx, length);
sb.Insert(startItemIdx, replaceWith);
}
return sb;
}
Run Code Online (Sandbox Code Playgroud)
在这里试试这个。一切都是基于字符的,效率水平相对较低。可以使用任意数量的*s 或s 。?然而,你的*现在\xe2\x9c\xaa和你的?现在\xe2\x98\x85。大约花了三天的时间来使它尽可能干净。您甚至可以一次扫描输入多个查询!
用法示例:wildcard(new StringBuilder("Hello and welcome"), "hello\xe2\x9c\xaaw\xe2\x98\x85l", "be")结果为“become”。
////////////////////////////////////////////////////////////////////////////////////////////////////////\n///////////// Search for a string/s inside \'text\' using the \'find\' parameter, and replace with a string/s using the replace parameter\n// \xe2\x9c\xaa represents multiple wildcard characters (non-greedy)\n// \xe2\x98\x85 represents a single wildcard character\npublic StringBuilder wildcard(StringBuilder text, string find, string replace, bool caseSensitive = false)\n{\n return wildcard(text, new string[] { find }, new string[] { replace }, caseSensitive);\n}\npublic StringBuilder wildcard(StringBuilder text, string[] find, string[] replace, bool caseSensitive = false)\n{\n if (text.Length == 0) return text; // Degenerate case\n\n StringBuilder sb = new StringBuilder(); // The new adjusted string with replacements\n for (int i = 0; i < text.Length; i++) { // Go through every letter of the original large text\n\n bool foundMatch = false; // Assume match hasn\'t been found to begin with\n for(int q=0; q< find.Length; q++) { // Go through each query in turn\n if (find[q].Length == 0) continue; // Ignore empty queries\n\n int f = 0; int g = 0; // Query cursor and text cursor\n bool multiWild = false; // multiWild is \xe2\x9c\xaa symbol which represents many wildcard characters\n int multiWildPosition = 0; \n\n while(true) { // Loop through query characters\n if (f >= find[q].Length || (i + g) >= text.Length) break; // Bounds checking\n char cf = find[q][f]; // Character in the query (f is the offset)\n char cg = text[i + g]; // Character in the text (g is the offset)\n if (!caseSensitive) cg = char.ToLowerInvariant(cg);\n if (cf != \'\xe2\x98\x85\' && cf != \'\xe2\x9c\xaa\' && cg != cf && !multiWild) break; // Break search, and thus no match is found\n if (cf == \'\xe2\x9c\xaa\') { multiWild = true; multiWildPosition = f; f++; continue; } // Multi-char wildcard activated. Move query cursor, and reloop\n if (multiWild && cg != cf && cf != \'\xe2\x98\x85\') { f = multiWildPosition + 1; g++; continue; } // Match since MultiWild has failed, so return query cursor to MultiWild position\n f++; g++; // Reaching here means that a single character was matched, so move both query and text cursor along one\n }\n\n if (f == find[q].Length) { // If true, query cursor has reached the end of the query, so a match has been found!!!\n sb.Append(replace[q]); // Append replacement\n foundMatch = true;\n if (find[q][f - 1] == \'\xe2\x9c\xaa\') { i = text.Length; break; } // If the MultiWild is the last char in the query, then the rest of the string is a match, and so close off\n i += g - 1; // Move text cursor along by the amount equivalent to its found match\n }\n }\n if (!foundMatch) sb.Append(text[i]); // If a match wasn\'t found at that point in the text, then just append the original character\n }\n return sb;\n}\nRun Code Online (Sandbox Code Playgroud)\n