该TextElementEnumerator是非常有用和有效的:
private static List<SoundCount> CountSounds(IEnumerable<string> words)
{
Dictionary<string, SoundCount> soundCounts = new Dictionary<string, SoundCount>();
foreach (var word in words)
{
TextElementEnumerator graphemeEnumerator = StringInfo.GetTextElementEnumerator(word);
while (graphemeEnumerator.MoveNext())
{
string grapheme = graphemeEnumerator.GetTextElement();
SoundCount count;
if (!soundCounts.TryGetValue(grapheme, out count))
{
count = new SoundCount() { Sound = grapheme };
soundCounts.Add(grapheme, count);
}
count.Count++;
}
}
return new List<SoundCount>(soundCounts.Values);
}
Run Code Online (Sandbox Code Playgroud)
您也可以使用正则表达式执行此操作:(从文档中,TextElementEnumerator处理下面的表达式没有的一些情况,特别是补充字符,但这些非常罕见,在任何情况下我的应用程序都不需要.)
private static List<SoundCount> CountSoundsRegex(IEnumerable<string> words)
{
var soundCounts = new Dictionary<string, SoundCount>();
var graphemeExpression = new Regex(@"\P{M}\p{M}*");
foreach (var word in words)
{
Match graphemeMatch = graphemeExpression.Match(word);
while (graphemeMatch.Success)
{
string grapheme = graphemeMatch.Value;
SoundCount count;
if (!soundCounts.TryGetValue(grapheme, out count))
{
count = new SoundCount() { Sound = grapheme };
soundCounts.Add(grapheme, count);
}
count.Count++;
graphemeMatch = graphemeMatch.NextMatch();
}
}
return new List<SoundCount>(soundCounts.Values);
}
Run Code Online (Sandbox Code Playgroud)
性能:在我的测试中,我发现TextElementEnumerator的速度是正则表达式的4倍.
不幸的是,没有办法"调整"TextElementEnumerator枚举的方式,因此该类在现实场景中没有用处.
一种解决方案是调整我们的正则表达式:
[\P{M}\P{Lm}] # Match a character that is NOT a character intended to be combined with another character or a special character that is used like a letter
(?: # Start a group for the combining characters:
(?: # Start a group for tied characters:
[\u035C\u0361] # Match an under- or over- tie bar...
\P{M}\p{M}* # ...followed by another grapheme (in the simplified sense)
) # (End the tied characters group)
|\p{M} # OR a character intended to be combined with another character
|\p{Lm} # OR a special character that is used like a letter
)* # Match the combining characters group zero or more times.
Run Code Online (Sandbox Code Playgroud)
我们也可以使用CharUnicodeInfo.GetUnicodeCategory创建我们自己的IEnumerator <string>来重新获得我们的性能,但这似乎对我来说太多了,需要额外的代码来维护.(还有其他人想要去吗?)Regex是为此而制作的.
归档时间: |
|
查看次数: |
833 次 |
最近记录: |