区分忽略空格,变音符号和大小写的字符串列表

Luc*_*ira 5 .net c# unicode

给定以下字符串列表:

string[] Itens = new string[] { "hi", " hi   ", "HI", "hí", " Hî", "hi hi", " hí hí ", "olá", "OLÁ", " olá   ", "", "ola", "hola", " holà    ", "aaaa", "áâàa", " aâàa     ", "áaàa", "áâaa ", "aaaa ", "áâaa", "áâaa", };
Run Code Online (Sandbox Code Playgroud)

Distinct操作的结果应为:

hi, hi hi, olá, , hola, aaaa
Run Code Online (Sandbox Code Playgroud)

IEnumerable可用的C#的Distinct操作接受IEqualityComparer作为参数,因此我们可以个性化比较。

以下实现可以完成工作

class LengthHash : IEqualityComparer<string>
{
    public bool Equals(string x, string y)
    {
        if (x == null || y == null) return x == y;

        var xt = x.Trim();
        var yt = y.Trim();

        return xt.Length == yt.Length && Culture.CompareInfo.IndexOf(xt, yt, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase) >= 0;
    }

    public int GetHashCode(string obj) => obj?.Trim().Length ?? 1;
}
Run Code Online (Sandbox Code Playgroud)

如果GetHashCode不同,则Equals甚至不会执行,因此拥有一个良好的实现很重要。

我尝试将GetHashCode更改为其他2种不同的方法。

忽略哈希

public int GetHashCode(string obj) => 1;
Run Code Online (Sandbox Code Playgroud)

标准化哈希

public int GetHashCode(string obj) => obj?.Trim().Normalize().ToUpperInvariant().GetHashCode() ?? 1;
// obs: This approach doesn't produce the same output.
Run Code Online (Sandbox Code Playgroud)

除了使用个性化的IEqualityComparer外,我还尝试过在执行StringComparer.InvariantCultureIgnoreCase之前修剪列表,但它产生的输出与Normalize和Upper版本相同。

在纯Distinct,StringComparer.InvariantCultureIgnoreCase和3种个性化方法上进行基准测试可得出以下结果:

                              Method |       Mean |    StdErr |    StdDev |     Median |
------------------------------------ |----------- |---------- |---------- |----------- |
                          RunDefault |  2.2224 us | 0.0242 us | 0.2391 us |  2.1414 us |
                     RunHashAsLength |  6.0765 us | 0.0515 us | 0.1857 us |  6.1235 us |
                       RunIgnoreHash |  6.4078 us | 0.0640 us | 0.6140 us |  6.1982 us |
                   RunNormalizedHash | 14.5941 us | 0.0742 us | 0.3556 us | 14.4983 us |
 RunTrimAndCompareWithStringComparer | 14.4935 us | 0.0213 us | 0.0768 us | 14.5352 us |
Run Code Online (Sandbox Code Playgroud)

输出为:

21 Default: hi,  hi   , HI, hí,  Hî, hi hi,  hí hí , olá, OLÁ,  olá   , , ola, hola,  holà    , aaaa, áâàa,  aâàa     , áaàa, áâaa , aaaa , áâaa
6 HashAsLength: hi, hi hi, olá, , hola, aaaa
6 IgnoreHash: hi, hi hi, olá, , hola, aaaa
15 NormalizedHash: hi, hí,  Hî, hi hi,  hí hí , olá, , ola, hola,  holà    , aaaa, áâàa,  aâàa     , áaàa, áâaa
15 RunTrimAndCompareWithStringComparer: hi, hí, Hî, hi hi, hí hí, olá, , ola, hola, holà, aaaa, áâàa, aâàa, áaàa, áâaa
Run Code Online (Sandbox Code Playgroud)

您可以在https://gist.github.com/Flash3001/d50a6b43bba7bc61e3d85734e40dbed9中找到完整的测试

问题是:是否有更好的方法来达到所需的最终列表?它可以是不同的GetHashCode,Equals或其他预定义的IEqualityComparer。

The*_*ias 0

CompareInfo您可以使用类、Compare和提供的指定方法GetHashCode。这样您就可以确保实施是一致的。正确性是第一位的。性能是次要的。

\n\n
class StringEqualityComparer : IEqualityComparer<string>\n{\n    private CultureInfo _cultureInfo;\n    private CompareOptions _options;\n    private bool _trim;\n\n    public StringEqualityComparer(CultureInfo cultureInfo,\n        CompareOptions options, bool trim)\n    {\n        _cultureInfo = cultureInfo;\n        _options = options;\n        _trim = trim;\n    }\n\n    public bool Equals(string x, string y)\n    {\n        if (_trim) { x = x?.Trim(); y = y?.Trim(); }\n        return _cultureInfo.CompareInfo.Compare(x, y, _options) == 0;\n    }\n\n    public int GetHashCode(string obj)\n    {\n        if (_trim) obj = obj?.Trim();\n        return _cultureInfo.CompareInfo.GetHashCode(obj, _options);\n    }\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n

使用示例:

\n\n
var comparer = new StringEqualityComparer(CultureInfo.InvariantCulture,\n    CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase, true);\nvar items = new string[] { "hi", " hi   ", "HI", "h\xc3\xad", " H\xc3\xae", "hi hi", " h\xc3\xad h\xc3\xad ",\n    "ol\xc3\xa1", "OL\xc3\x81", " ol\xc3\xa1   ", "", "ola", "hola", " hol\xc3\xa0    ", "aaaa", "\xc3\xa1\xc3\xa2\xc3\xa0a",\n    " a\xc3\xa2\xc3\xa0a     ", "\xc3\xa1a\xc3\xa0a", "\xc3\xa1\xc3\xa2aa ", "aaaa ", "\xc3\xa1\xc3\xa2aa", "\xc3\xa1\xc3\xa2aa", };\nConsole.WriteLine($"Distinct: {String.Join(", ", items.Distinct(comparer))}");\n
Run Code Online (Sandbox Code Playgroud)\n\n

输出:

\n\n
\n

不同:嗨,嗨嗨,ol\xc3\xa1,,hola,aaaa

\n
\n