如何从字符串中间执行对文化敏感的"启动"操作？

Jon*_*eet 105 .net string unicode

我有一个相对模糊的要求,但感觉应该可以使用BCL.

对于上下文,我正在Noda Time中解析日期/时间字符串.我为输入字符串中的位置维护一个逻辑光标.因此,虽然完整的字符串可能是"2013年1月3日",但逻辑光标可能位于"J".

现在,我需要解析月份名称,将其与文化的所有已知月份名称进行比较:

文化敏感
不区分大小写
只是从光标的角度来看(不是更晚;我想看看光标是否"看着"候选月份名称)
很快
......之后我需要知道使用了多少个字符

在当前的代码做这个工作通常使用CompareInfo.Compare.它实际上是这样的(仅用于匹配部分 - 在真实的东西中有更多的代码,但它与匹配无关):

internal bool MatchCaseInsensitive(string candidate, CompareInfo compareInfo)
{
    return compareInfo.Compare(text, position, candidate.Length,
                               candidate, 0, candidate.Length, 
                               CompareOptions.IgnoreCase) == 0;
}

Run Code Online (Sandbox Code Playgroud)

但是,这取决于候选人和我们比较的区域长度相同.大部分时间都很好,但在某些特殊情况下并不好.假设我们有类似的东西:

// U+00E9 is a single code point for e-acute
var text = "x b\u00e9d y";
int position = 2;
// e followed by U+0301 still means e-acute, but from two code points
var candidate = "be\u0301d";

Run Code Online (Sandbox Code Playgroud)

现在我的比较会失败.我可以用IsPrefix:

if (compareInfo.IsPrefix(text.Substring(position), candidate,
                         CompareOptions.IgnoreCase))

Run Code Online (Sandbox Code Playgroud)

但:

这需要我创建一个子串,我真的宁愿避免.(我正在将Noda Time视为一个有效的系统库;解析性能对于某些客户来说可能很重要.)
它没有告诉我之后前进光标的距离

实际上,我强烈怀疑这种情况不会经常发生......但我真的很想在这里做正确的事情.我也非常希望能够在不成为Unicode专家或自己实现的情况下完成它:)

(在Noda Time中被提升为错误210,以防任何人想要遵循任何最终结论.)

我喜欢规范化的想法.我需要详细检查a)正确性和b)性能.假设我可以使它正常工作,我仍然不确定它是否值得改变所有 - 它可能在实际生活中可能永远不会出现,但可能会损害我所有用户的性能: (

我也检查了BCL - 它似乎也没有正确处理.示例代码:

using System;
using System.Globalization;

class Test
{
    static void Main()
    {
        var culture = (CultureInfo) CultureInfo.InvariantCulture.Clone();
        var months = culture.DateTimeFormat.AbbreviatedMonthNames;
        months[10] = "be\u0301d";
        culture.DateTimeFormat.AbbreviatedMonthNames = months;

        var text = "25 b\u00e9d 2013";
        var pattern = "dd MMM yyyy";
        DateTime result;
        if (DateTime.TryParseExact(text, pattern, culture,
                                   DateTimeStyles.None, out result))
        {
            Console.WriteLine("Parsed! Result={0}", result);
        }
        else
        {
            Console.WriteLine("Didn't parse");
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

将自定义月份名称更改为"床",文本值为"bEd",解析正常.

好的,还有一些数据点:

使用的成本Substring和IsPrefix重要但不可怕.在我的开发笔记本电脑上的"星期五2013年4月12日20:28:42"的示例中,它将我可以执行的解析操作的数量从大约460K更改为大约400K.如果可能的话,我宁愿避免这种放缓,但这并不算太糟糕.
规范化不如我想象的那么可行 - 因为它在Portable Class Libraries中不可用.我可能只将它用于非PCL构建,允许PCL构建不那么正确.测试规范化(string.IsNormalized)的性能损失将性能降低到每秒约445K次呼叫,这是我可以忍受的.我仍然不确定它是否能满足我所需要的一切 - 例如,包含"ß"的月份名称应与许多文化中的"ss"相匹配,我相信......并且正常化并不能做到这一点.

我将首先考虑许多< - >一个/多个案例映射的问题,并与处理不同的规范化表单分开考虑.

例如:

x heiße y
  ^--- cursor

Run Code Online (Sandbox Code Playgroud)

匹配,heisse但然后过多地移动光标1.和:

x heisse y
  ^--- cursor

Run Code Online (Sandbox Code Playgroud)

匹配,heiße但然后移动光标1太少.

这将适用于没有简单的一对一映射的任何字符.

您需要知道实际匹配的子字符串的长度.但是Compare,IndexOf..等把这些信息扔掉了.它可以使用正则表达式,但实现不会进行完整的大小写折叠,因此在不区分大小写的模式下也不匹配 ß,ss/SS即使这样做.Compare也是.IndexOf如此.无论如何,为每个候选人创建新的正则表达可能代价很高.

对此最简单的解决方案是在内部存储字符串以便折叠形式,并与案例折叠候选者进行二进制比较.然后您可以正确移动光标,.Length因为光标是用于内部表示.您还可以从不必使用中获得大部分丢失的性能CompareOptions.IgnoreCase.

不幸的是,没有内置的折叠功能和穷人的案例折叠也不起作用,因为没有完整的案例映射 - 该ToUpper方法没有ß变成SS.

例如,这适用于Java(甚至在Javascript中),给定的格式为普通形式C:

//Poor man's case folding.
//There are some edge cases where this doesn't work
public static String toCaseFold( String input, Locale cultureInfo ) {
    return input.toUpperCase(cultureInfo).toLowerCase(cultureInfo);
}

Run Code Online (Sandbox Code Playgroud)

有趣的是,Java的忽略大小写比较并没有像C#那样进行完整的大小写折叠CompareOptions.IgnoreCase.所以他们在这方面是相反的:Java做了完整的casemapping,但简单的案例折叠 - C#做了简单的casemapping,但是完整的case折叠.

因此,在使用它们之前,您可能需要使用第三方库来折叠字符串.

在做任何事情之前,你必须确保你的字符串是正常形式C.你可以使用针对拉丁文脚本优化的初步快速检查:

public static bool MaybeRequiresNormalizationToFormC(string input)
{
    if( input == null ) throw new ArgumentNullException("input");

    int len = input.Length;
    for (int i = 0; i < len; ++i)
    {
        if (input[i] > 0x2FF)
        {
            return true;
        }
    }

    return false;
}

Run Code Online (Sandbox Code Playgroud)

这给出了误报但不是假阴性,我不希望它在使用拉丁文字符时完全减慢460k parses/s,即使它需要在每个字符串上执行.如果有误报,您将使用IsNormalized真正的负面/正面,并且只有在必要时才进行正常化.

总而言之,处理是首先确保正常形式C,然后是情况折叠.与处理过的字符串进行二进制比较,并在当前移动光标时移动光标.

@JonSkeet是的,Turkic在casefold映射中应该拥有自己的模式:P请参见[CaseFolding.txt]标题中的格式部分(http://www.unicode.org/Public/UNIDATA/CaseFolding.txt) (2认同)

看看这是否符合要求..:

public static partial class GlobalizationExtensions {
    public static int IsPrefix(
        this CompareInfo compareInfo,
        String source, String prefix, int startIndex, CompareOptions options
        ) {
        if(compareInfo.IndexOf(source, prefix, startIndex, options)!=startIndex)
            return ~0;
        else
            // source is started with prefix
            // therefore the loop must exit
            for(int length2=0, length1=prefix.Length; ; )
                if(0==compareInfo.Compare(
                        prefix, 0, length1, 
                        source, startIndex, ++length2, options))
                    return length2;
    }
}

Run Code Online (Sandbox Code Playgroud)

compareInfo.Compare只执行一次source开始prefix; 如果没有,则IsPrefix返回-1; 否则,使用的字符长度source.

不过,我除了增量不知道length2通过1与下面的情况:

var candidate="ßssß\u00E9\u0302";
var text="abcd ssßss\u0065\u0301\u0302sss";

var count=
    culture.CompareInfo.IsPrefix(text, candidate, 5, CompareOptions.IgnoreCase);

Run Code Online (Sandbox Code Playgroud)

更新:

我尝试改进一点性能,但未证明以下代码中是否存在错误:

public static partial class GlobalizationExtensions {
    public static int Compare(
        this CompareInfo compareInfo,
        String source, String prefix, int startIndex, ref int length2, 
        CompareOptions options) {
        int length1=prefix.Length, v2, v1;

        if(0==(v1=compareInfo.Compare(
            prefix, 0, length1, source, startIndex, length2, options))
            ) {
            return 0;
        }
        else {
            if(0==(v2=compareInfo.Compare(
                prefix, 0, length1, source, startIndex, 1+length2, options))
                ) {
                ++length2;
                return 0;
            }
            else {
                if(v1<0||v2<0) {
                    length2-=2;
                    return -1;
                }
                else {
                    length2+=2;
                    return 1;
                }
            }
        }
    }

    public static int IsPrefix(
        this CompareInfo compareInfo,
        String source, String prefix, int startIndex, CompareOptions options
        ) {
        if(compareInfo.IndexOf(source, prefix, startIndex, options)
                !=startIndex)
            return ~0;
        else
            for(int length2=
                    Math.Min(prefix.Length, source.Length-(1+startIndex)); ; )
                if(0==compareInfo.Compare(
                        source, prefix, startIndex, ref length2, options))
                    return length2;
    }
}

Run Code Online (Sandbox Code Playgroud)

我测试了特定的情况,比较到3左右.

确实.Noda Time不希望进入Unicode细节业务:) (3认同)
我曾经这样解决过类似的问题(HTML中的搜索字符串突出显示).我这样做了.您可以通过首先检查可能的案例来快速完成循环和搜索策略,从而调整循环和搜索策略.关于这一点的好处是,它似乎是完全正确的,没有Unicode细节泄漏到您的代码中. (2认同)

这实际上是可能的,没有标准化和不使用IsPrefix.

我们需要比较相同数量的文本元素而不是相同数量的字符,但仍然返回匹配字符的数量.

我已经在Noda Time中MatchCaseInsensitive从ValueCursor.cs创建了该方法的副本,并稍微修改了它,以便它可以在静态上下文中使用:

// Noda time code from MatchCaseInsensitive in ValueCursor.cs
static int IsMatch_Original(string source, int index, string match, CompareInfo compareInfo)
{
    unchecked
    {
        if (match.Length > source.Length - index)
        {
            return 0;
        }

        // TODO(V1.2): This will fail if the length in the input string is different to the length in the
        // match string for culture-specific reasons. It's not clear how to handle that...
        if (compareInfo.Compare(source, index, match.Length, match, 0, match.Length, CompareOptions.IgnoreCase) == 0)
        {
            return match.Length;
        }

        return 0;
    }
}

Run Code Online (Sandbox Code Playgroud)

(仅供参考,如您所知,代码无法正确比较)

该方法的以下变体使用框架提供的StringInfo.GetNextTextElement.我们的想法是将text元素与text元素进行比较以查找匹配项,如果找到则返回源字符串中匹配字符的实际数量:

// Using StringInfo.GetNextTextElement to match by text elements instead of characters
static int IsMatch_New(string source, int index, string match, CompareInfo compareInfo)
{
    int sourceIndex = index;
    int matchIndex = 0;

    // Loop until we reach the end of source or match
    while (sourceIndex < source.Length && matchIndex < match.Length)
    {
        // Get text elements at the current positions of source and match
        // Normally that will be just one character but may be more in case of Unicode combining characters
        string sourceElem = StringInfo.GetNextTextElement(source, sourceIndex);
        string matchElem = StringInfo.GetNextTextElement(match, matchIndex);

        // Compare the current elements.
        if (compareInfo.Compare(sourceElem, matchElem, CompareOptions.IgnoreCase) != 0)
        {
            return 0; // No match
        }

        // Advance in source and match (by number of characters)
        sourceIndex += sourceElem.Length;
        matchIndex += matchElem.Length;
    }

    // Check if we reached end of source and not end of match
    if (matchIndex != match.Length)
    {
        return 0; // No match
    }

    // Found match. Return number of matching characters from source.
    return sourceIndex - index;
}

Run Code Online (Sandbox Code Playgroud)

这种方法至少可以根据我的测试用例(基本上只是测试你提供的字符串的几个变体:"b\u00e9d"和"be\u0301d")来运行.

但是,GetNextTextElement方法为每个文本元素创建一个子字符串,因此这种实现需要很多子字符串比较 - 这将对性能产生影响.

因此,我创建了另一个不使用GetNextTextElement的变体,而是跳过Unicode组合字符以查找字符中的实际匹配长度:

// This should be faster
static int IsMatch_Faster(string source, int index, string match, CompareInfo compareInfo)
{
    int sourceLength = source.Length;
    int matchLength = match.Length;
    int sourceIndex = index;
    int matchIndex = 0;

    // Loop until we reach the end of source or match
    while (sourceIndex < sourceLength && matchIndex < matchLength)
    {
        sourceIndex += GetTextElemLen(source, sourceIndex, sourceLength);
        matchIndex += GetTextElemLen(match, matchIndex, matchLength);
    }

    // Check if we reached end of source and not end of match
    if (matchIndex != matchLength)
    {
        return 0; // No match
    }

    // Check if we've found a match
    if (compareInfo.Compare(source, index, sourceIndex - index, match, 0, matchIndex, CompareOptions.IgnoreCase) != 0)
    {
        return 0; // No match
    }

    // Found match. Return number of matching characters from source.
    return sourceIndex - index;
}

Run Code Online (Sandbox Code Playgroud)

该方法使用以下两个助手:

static int GetTextElemLen(string str, int index, int strLen)
{
    bool stop = false;
    int elemLen;

    for (elemLen = 0; index < strLen && !stop; ++elemLen, ++index)
    {
        stop = !IsCombiningCharacter(str, index);
    }

    return elemLen;
}

static bool IsCombiningCharacter(string str, int index)
{
    switch (CharUnicodeInfo.GetUnicodeCategory(str, index))
    {
        case UnicodeCategory.NonSpacingMark:
        case UnicodeCategory.SpacingCombiningMark:
        case UnicodeCategory.EnclosingMark:
            return true;

        default:
            return false;
    }
}

Run Code Online (Sandbox Code Playgroud)

我没有做任何基准测试,所以我真的不知道更快的方法实际上是否更快.我也没有做过任何扩展测试.

但是这应该回答你关于如何对可能包含Unicode组合字符的字符串执行文化敏感子字符串匹配的问题.

这些是我用过的测试用例:

static Tuple<string, int, string, int>[] tests = new []
{
    Tuple.Create("x b\u00e9d y", 2, "be\u0301d", 3),
    Tuple.Create("x be\u0301d y", 2, "b\u00e9d", 4),

    Tuple.Create("x b\u00e9d", 2, "be\u0301d", 3),
    Tuple.Create("x be\u0301d", 2, "b\u00e9d", 4),

    Tuple.Create("b\u00e9d y", 0, "be\u0301d", 3),
    Tuple.Create("be\u0301d y", 0, "b\u00e9d", 4),

    Tuple.Create("b\u00e9d", 0, "be\u0301d", 3),
    Tuple.Create("be\u0301d", 0, "b\u00e9d", 4),

    Tuple.Create("b\u00e9", 0, "be\u0301d", 0),
    Tuple.Create("be\u0301", 0, "b\u00e9d", 0),
};

Run Code Online (Sandbox Code Playgroud)

元组值是:

源字符串(haystack)
源头的起始位置.
匹配字符串(针).
预期的匹配长度.

对这三种方法运行这些测试会产生以下结果:

Test #0: Orignal=BAD; New=OK; Faster=OK
Test #1: Orignal=BAD; New=OK; Faster=OK
Test #2: Orignal=BAD; New=OK; Faster=OK
Test #3: Orignal=BAD; New=OK; Faster=OK
Test #4: Orignal=BAD; New=OK; Faster=OK
Test #5: Orignal=BAD; New=OK; Faster=OK
Test #6: Orignal=BAD; New=OK; Faster=OK
Test #7: Orignal=BAD; New=OK; Faster=OK
Test #8: Orignal=OK; New=OK; Faster=OK
Test #9: Orignal=OK; New=OK; Faster=OK

Run Code Online (Sandbox Code Playgroud)

最后两个测试是测试源字符串短于匹配字符串的情况.在这种情况下,原始(Noda时间)方法也将成功.

归档时间：	12 年，6 月前
查看次数：	5856 次
最近记录：	11 年，7 月前

如何检查结构消耗的字节数？ 42

在.config文件中启用自定义部分的智能感知 38

在定位C++ 03时使用std :: basic_string <t>作为连续缓冲区是否合理？ 32

如何在iOS上使用NSLog打印出字符串常量 22

Python使用正则表达式和replace()查找某些字符之间的子字符串 13

在Python中对字符串使用.format时的KeyError 11

在N个字符串中查找公共子字符串的算法 8

使用AngularJS在HTML中显示Unicode？ 8

如何正确显示日语RTF字体 5

如何将项目符号字符打印到控制台？ 5

可以在JSON中使用注释吗？ 7104

如何确定变量是"未定义"还是"空"？ 2000

什么时候使用自我超过$？ 1948

什么是Android上的"上下文"？ 1872

Node.js module.exports的目的是什么,你如何使用它？ 1397

安全地将JSON字符串转换为对象 1298

如何在JavaScript正则表达式中访问匹配的组？ 1277

visibility:hidden和display:none之间有什么区别？ 1121

在拉动期间解决Git合并冲突以支持其更改 1104

如何查找Python中是否存在目录 1048