使用Unicode排序规则算法在Ruby中排序

Question

使用Unicode排序规则算法在Ruby中排序

Sch*_*ern 8 ruby postgresql unicode collation

Ruby和Postgres的排序略有不同，这在我的项目中引起了细微的问题。有两个问题：重音字符和空格。看起来Ruby正在按ASCII进行排序，而Postgres正在使用适当的Unicode排序算法进行排序。

Heroku Postgres 11.2。数据库排序规则为en_US.UTF-8。

psql (11.3, server 11.2 (Ubuntu 11.2-1.pgdg16.04+1))
...
=> select 'quia et' > 'qui qui';
 ?column? 
----------
 f
(1 row)
=> select 'quib' > 'qüia';
 ?column? 
----------
 t
(1 row)

Run Code Online (Sandbox Code Playgroud)

Heroku上的Ruby 2.4.4。

Loading production environment (Rails 5.2.2.1)
[1] pry(main)> 'quia et' > 'qui qui'
=> true
[2] pry(main)> 'quib' > 'qüia'
=> false
[3] pry(main)> ENV['LANG']
=> "en_US.UTF-8"

Run Code Online (Sandbox Code Playgroud)

我可以修复重音字符的处理，但是我无法让Ruby正确处理空格。例如，这是他们对相同列表进行排序的方式。

Postgres: ["hic et illum", "quia et ipsa", "qui qui non"]
Ruby:     ["hic et illum", "qui qui non", "quia et ipsa"]

Run Code Online (Sandbox Code Playgroud)

我试过了icunicodegem：

array.sort_by {|s| s.unicode_sort_key}

Run Code Online (Sandbox Code Playgroud)

这可以处理带重音符号的字符，但不能正确处理空格。

如何让Ruby使用Unicode排序规则算法进行排序？

更新在Unicode®技术标准＃10中可以找到更全面的示例。这些是正确的顺序。

  [
    "di Silva   Fred",
    "diSilva    Fred",
    "disílva    Fred",
    "di Silva   John",
    "diSilva    John",
    "disílva    John"
  ]

Run Code Online (Sandbox Code Playgroud)

Answer 1

Sch*_*ern 5

I got very close using this algorithm with the icunicode gem.

require 'icunicode'

def database_sort_key(key)
  key.gsub(/\s+/,'').unicode_sort_key
end

array.sort_by { |v|
  [database_sort_key(v), v.unicode_sort_key]
}

Run Code Online (Sandbox Code Playgroud)

First we sort using the unicode sort key with whitespace removed. Then if those are the same we sort by the unicode sort key of the original value.

This works around a weakness in unicode_sort_key: it doesn't consider spaces to be weak.

2.4.4 :007 > "fo p".unicode_sort_key.bytes.map { |b| b.to_s(16) }
 => ["33", "45", "4", "47", "1", "8", "1", "8"] 
2.4.4 :008 > "foo".unicode_sort_key.bytes.map { |b| b.to_s(16) }
 => ["33", "45", "45", "1", "7", "1", "7"]

Run Code Online (Sandbox Code Playgroud)

Note that the space in fo p is as important as any other character. This results in 'fo p' < 'foo' which is incorrect. We work around this by first stripping out spaces before generating the key.

2.4.4 :011 > "fo p".gsub(/\s+/, '').unicode_sort_key.bytes.map { |b| b.to_s(16) }
 => ["33", "45", "47", "1", "7", "1", "7"] 
2.4.4 :012 > "foo".gsub(/\s+/, '').unicode_sort_key.bytes.map { |b| b.to_s(16) }
 => ["33", "45", "45", "1", "7", "1", "7"]

Run Code Online (Sandbox Code Playgroud)

Now 'foo' < 'fo p' which is correct.

But because of the normalization we might have values which appear to be the same after whitespace has been stripped, fo o should be less than foo. So if the database_sort_keys are the same, we compare their plain unicode_sort_keys.

在一些极端情况下，这是错误的。foo应该小于，fo o但这会使它倒退。

这就是Enumerable方法。

module Enumerable
  # Just like `sort`, but tries to sort the same as the database does
  # using the proper Unicode collation algorithm. It's close.
  #
  # Differences in spacing, cases, and accents are less important than
  # character differences.
  #
  # "foo" < "fo p" o vs p is more important than the space difference
  # "Foo" < "fop" o vs p is more important than is case difference
  # "föo" < "fop" o vs p is more important than the accent difference
  #
  # It does not take a block.
  def sort_like_database(&block)
    if block_given?
      raise ArgumentError, "Does not accept a block"
    else
      # Sort by the database sort key. Two different strings can have the
      # same keys, if so sort just by its unicode sort key.
      sort_by { |v| [database_sort_key(v), v.unicode_sort_key] }
    end
  end

  # Just like `sort_by`, but it sorts like `sort_like_database`.
  def sort_by_like_database(&block)
    sort_by { |v|
      field = block.call(v)
      [database_sort_key(field), field.unicode_sort_key]
    }
  end

  # Sort by the unicode sort key after stripping out all spaces. This provides
  # a decent simulation of the Unicode collation algorithm and how it handles
  # spaces.
  private def database_sort_key(key)
    key.gsub(/\s+/,'').unicode_sort_key
  end
end

Run Code Online (Sandbox Code Playgroud)

Answer 2

gwc*_*des 5

您的用例是否允许简单地将排序委托给 Postgres，而不是尝试在 Ruby 中重新创建它？

这里的部分困难是没有一个正确的排序方法，但任何可变元素都可能导致最终排序顺序出现相当大的差异，例如请参阅有关可变权重的部分。

例如，像twitter-cldr-rb这样的 gem有一个相当健壮的 UCA 实现，并且有一个全面的测试套件支持 - 但针对的是不可忽略的测试用例，这与 Postgres 实现不同（Postgres 似乎使用移位修剪的变体）。

测试用例的绝对数量意味着您无法保证一种可行的解决方案在所有情况下都与 Postgres 排序顺序相匹配。例如，它能正确处理 en/em 破折号，甚至表情符号吗？您可以分叉并修改twitter-cldr-rbgem，但我怀疑这不是一项小任务！

如果您需要处理数据库中不存在的值，您可以要求 Postgres 使用列表以轻量级方式对它们进行排序VALUES：

sql = "SELECT * FROM (VALUES ('de luge'),('de Luge'),('de-luge'),('de-Luge'),('de-luge'),('de-Luge'),('death'),('deluge'),('deLuge'),('demark')) AS t(term) ORDER BY term ASC"
ActiveRecord::Base.connection.execute(sql).values.flatten

Run Code Online (Sandbox Code Playgroud)

显然，它会导致往返于 Postgres，但仍然应该非常快。

归档时间：	6 年，5 月前
查看次数：	159 次
最近记录：	6 年，4 月前