有许多类似的问题,但显然没有完美的匹配,这就是我要问的原因.
我想分裂一个随机字符串(如123xx456yy789
通过字符串分隔符的列表)(例如xx
,yy
),并包括在结果中的分隔符(在这里:123
,xx
,456
,yy
,789
).
良好的表现是一个很好的奖金.如果可能的话,应该避免使用正则表达式.
更新:我做了一些性能检查并比较了结果(虽然懒得正式检查).测试的解决方案是(随机顺序):
其他解决方案未经过测试,因为它们与其他解决方案类似,或者来得太晚.
这是测试代码:
class Program
{
private static readonly List<Func<string, List<string>, List<string>>> Functions;
private static readonly List<string> Sources;
private static readonly List<List<string>> Delimiters;
static Program ()
{
Functions = new List<Func<string, List<string>, List<string>>> ();
Functions.Add ((s, l) => s.SplitIncludeDelimiters_Gabe (l).ToList ());
Functions.Add ((s, l) => s.SplitIncludeDelimiters_Guffa (l).ToList ());
Functions.Add ((s, l) => s.SplitIncludeDelimiters_Naive (l).ToList ());
Functions.Add ((s, l) => s.SplitIncludeDelimiters_Regex (l).ToList ());
Sources = new List<string> ();
Sources.Add ("");
Sources.Add (Guid.NewGuid ().ToString ());
string str = "";
for (int outer = 0; outer < 10; outer++) {
for (int i = 0; i < 10; i++) {
str += i + "**" + DateTime.UtcNow.Ticks;
}
str += "-";
}
Sources.Add (str);
Delimiters = new List<List<string>> ();
Delimiters.Add (new List<string> () { });
Delimiters.Add (new List<string> () { "-" });
Delimiters.Add (new List<string> () { "**" });
Delimiters.Add (new List<string> () { "-", "**" });
}
private class Result
{
public readonly int FuncID;
public readonly int SrcID;
public readonly int DelimID;
public readonly long Milliseconds;
public readonly List<string> Output;
public Result (int funcID, int srcID, int delimID, long milliseconds, List<string> output)
{
FuncID = funcID;
SrcID = srcID;
DelimID = delimID;
Milliseconds = milliseconds;
Output = output;
}
public void Print ()
{
Console.WriteLine ("S " + SrcID + "\tD " + DelimID + "\tF " + FuncID + "\t" + Milliseconds + "ms");
Console.WriteLine (Output.Count + "\t" + string.Join (" ", Output.Take (10).Select (x => x.Length < 15 ? x : x.Substring (0, 15) + "...").ToArray ()));
}
}
static void Main (string[] args)
{
var results = new List<Result> ();
for (int srcID = 0; srcID < 3; srcID++) {
for (int delimID = 0; delimID < 4; delimID++) {
for (int funcId = 3; funcId >= 0; funcId--) { // i tried various orders in my tests
Stopwatch sw = new Stopwatch ();
sw.Start ();
var func = Functions[funcId];
var src = Sources[srcID];
var del = Delimiters[delimID];
for (int i = 0; i < 10000; i++) {
func (src, del);
}
var list = func (src, del);
sw.Stop ();
var res = new Result (funcId, srcID, delimID, sw.ElapsedMilliseconds, list);
results.Add (res);
res.Print ();
}
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
正如您所看到的,它实际上只是一个快速而肮脏的测试,但我多次运行测试并且顺序不同,结果始终非常一致.对于较大的数据集,测量的时间范围在毫秒到秒的范围内.我在下面的评估中忽略了低毫秒范围内的值,因为它们在实践中似乎可以忽略不计.这是我的盒子上的输出:
S 0 D 0 F 3 11ms 1 S 0 D 0 F 2 7ms 1 S 0 D 0 F 1 6ms 1 S 0 D 0 F 0 4ms 0 S 0 D 1 F 3 28ms 1 S 0 D 1 F 2 8ms 1 S 0 D 1 F 1 7ms 1 S 0 D 1 F 0 3ms 0 S 0 D 2 F 3 30ms 1 S 0 D 2 F 2 8ms 1 S 0 D 2 F 1 6ms 1 S 0 D 2 F 0 3ms 0 S 0 D 3 F 3 30ms 1 S 0 D 3 F 2 10ms 1 S 0 D 3 F 1 8ms 1 S 0 D 3 F 0 3ms 0 S 1 D 0 F 3 9ms 1 9e5282ec-e2a2-4... S 1 D 0 F 2 6ms 1 9e5282ec-e2a2-4... S 1 D 0 F 1 5ms 1 9e5282ec-e2a2-4... S 1 D 0 F 0 5ms 1 9e5282ec-e2a2-4... S 1 D 1 F 3 63ms 9 9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1 D 1 F 2 37ms 9 9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1 D 1 F 1 29ms 9 9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1 D 1 F 0 22ms 9 9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1 D 2 F 3 30ms 1 9e5282ec-e2a2-4... S 1 D 2 F 2 10ms 1 9e5282ec-e2a2-4... S 1 D 2 F 1 10ms 1 9e5282ec-e2a2-4... S 1 D 2 F 0 12ms 1 9e5282ec-e2a2-4... S 1 D 3 F 3 73ms 9 9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1 D 3 F 2 40ms 9 9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1 D 3 F 1 33ms 9 9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1 D 3 F 0 30ms 9 9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 2 D 0 F 3 10ms 1 0**634226552821... S 2 D 0 F 2 109ms 1 0**634226552821... S 2 D 0 F 1 5ms 1 0**634226552821... S 2 D 0 F 0 127ms 1 0**634226552821... S 2 D 1 F 3 184ms 21 0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226 552821... - 0**634226552821... - S 2 D 1 F 2 364ms 21 0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226 552821... - 0**634226552821... - S 2 D 1 F 1 134ms 21 0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226 552821... - 0**634226552821... - S 2 D 1 F 0 517ms 20 0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226 552821... - 0**634226552821... - S 2 D 2 F 3 688ms 201 0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2 D 2 F 2 2404ms 201 0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2 D 2 F 1 874ms 201 0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2 D 2 F 0 717ms 201 0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2 D 3 F 3 1205ms 221 0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2 D 3 F 2 3471ms 221 0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2 D 3 F 1 1008ms 221 0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2 D 3 F 0 1095ms 220 0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... **
我比较了结果,这就是我发现的:
总结这个主题,我建议使用速度相当快的正则表达式.如果性能至关重要,我宁愿选择Guffa.
Ahm*_*eed 44
尽管你不愿意使用正则表达式,但它实际上通过使用一个组以及Regex.Split
方法很好地保留了分隔符:
string input = "123xx456yy789";
string pattern = "(xx|yy)";
string[] result = Regex.Split(input, pattern);
Run Code Online (Sandbox Code Playgroud)
如果从模式中删除括号,则使用just "xx|yy"
,不会保留分隔符.如果使用在regex中具有特殊含义的任何元字符,请务必在模式上使用Regex.Escape.人物包括\, *, +, ?, |, {, [, (,), ^, $,., #
.例如,.
应该转义分隔符\.
.给定分隔符列表,您需要使用管道|
符号"或"它们,并且这也是一个被转义的字符.要正确构建模式,请使用以下代码(感谢@gabe指出这一点):
var delimiters = new List<string> { ".", "xx", "yy" };
string pattern = "(" + String.Join("|", delimiters.Select(d => Regex.Escape(d))
.ToArray())
+ ")";
Run Code Online (Sandbox Code Playgroud)
括号是连接而不是包含在模式中,因为它们会因为您的目的而被错误地转义.
编辑:此外,如果delimiters
列表恰好是空的,最终的模式将不正确()
,这将导致空白匹配.为了防止这种情况,可以使用对分隔符的检查.考虑到所有这一切,片段变为:
string input = "123xx456yy789";
// to reach the else branch set delimiters to new List();
var delimiters = new List<string> { ".", "xx", "yy", "()" };
if (delimiters.Count > 0)
{
string pattern = "("
+ String.Join("|", delimiters.Select(d => Regex.Escape(d))
.ToArray())
+ ")";
string[] result = Regex.Split(input, pattern);
foreach (string s in result)
{
Console.WriteLine(s);
}
}
else
{
// nothing to split
Console.WriteLine(input);
}
Run Code Online (Sandbox Code Playgroud)
如果您需要对分隔符进行不区分大小写的匹配,请使用以下RegexOptions.IgnoreCase
选项:Regex.Split(input, pattern, RegexOptions.IgnoreCase)
编辑#2:到目前为止,解决方案匹配可能是较大字符串的子字符串的拆分令牌.如果拆分令牌应该完全匹配,而不是子串的一部分,例如句子中的单词用作分隔符的场景,\b
则应在模式周围添加单词边界元字符.
例如,考虑这句话(是的,它是老生常谈): "Welcome to stackoverflow... where the stack never overflows!"
如果分隔符是{ "stack", "flow" }
当前解决方案将拆分"stackoverflow"并返回3个字符串{ "stack", "over", "flow" }
.如果你需要一个完全匹配,那么这个分裂的唯一地方就是句子后面的"堆栈"而不是"stackoverflow".
要实现完全匹配行为,请更改模式以包括\b
如下\b(delim1|delim2|delimN)\b
:
string pattern = @"\b("
+ String.Join("|", delimiters.Select(d => Regex.Escape(d)))
+ @")\b";
Run Code Online (Sandbox Code Playgroud)
最后,如果需要修剪分隔符之前和之后的空格,请添加\s*
模式,如下所示\s*(delim1|delim2|delimN)\s*
.这可以结合\b
如下:
string pattern = @"\s*\b("
+ String.Join("|", delimiters.Select(d => Regex.Escape(d)))
+ @")\b\s*";
Run Code Online (Sandbox Code Playgroud)
Nag*_*agg 12
好的,抱歉,也许这一个:
string source = "123xx456yy789";
foreach (string delimiter in delimiters)
source = source.Replace(delimiter, ";" + delimiter + ";");
string[] parts = source.Split(';');
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
25618 次 |
最近记录: |