字符向量的R排序规则是什么?

And*_*rie 16 sorting r

R按照我描述为字母而非ASCII的顺序对字符向量进行排序.

例如:

sort(c("dog", "Cat", "Dog", "cat"))
[1] "cat" "Cat" "dog" "Dog"
Run Code Online (Sandbox Code Playgroud)

三个问题:

  1. 描述此排序顺序的技术上正确的术语是什么?
  2. 我在CRAN的手册中找不到任何参考.我在哪里可以找到R中排序规则的描述?
  3. 这与其他语言(如C,Java,Perl或PHP)中的这种行为有何不同?

Dir*_*tel 21

Details:对于sort()州:

 The sort order for character vectors will depend on the collating
 sequence of the locale in use: see ‘Comparison’.  The sort order
 for factors is the order of their levels (which is particularly
 appropriate for ordered factors).
Run Code Online (Sandbox Code Playgroud)

并且help(Comparison)然后显示:

 Comparison of strings in character vectors is lexicographicwithin
 the strings using the collating sequence of the locale in use:see
 ‘locales’.  The collating sequence of locales such as ‘en_US’ is
 normally different from ‘C’ (which should use ASCII) and can be
 surprising.  Beware of making _any_ assumptions about the 
 collation order: e.g. in Estonian ‘Z’ comes between ‘S’ and ‘T’,
 and collation is not necessarily character-by-character - in
 Danish ‘aa’ sorts as a single letter, after ‘z’.  In Welsh ‘ng’
 may or may not be a single sorting unit: if it is it follows ‘g’.
 Some platforms may not respect the locale and always sort in
 numerical order of the bytes in an 8-bit locale, or in Unicode
 point order for a UTF-8 locale (and may not sort in the same order
 for the same language in different character sets).  Collation of
 non-letters (spaces, punctuation signs, hyphens, fractions and so
 on) is even more problematic.
Run Code Online (Sandbox Code Playgroud)

所以这取决于您的区域设置.

  • 我不会试图在德克和帮助的描述来改善,但的R之外,人们可能会发现它描述为字典分类,虽然情况不变.整理规则是一个严肃的考虑因素,因为天真的文本处理通常是针对英语顺序进行的,这对其他一些语言来说是不好的.一个很好的例子是,对于母语使用者*或*只考虑严格AZ顺序中的26个字母的人来说,名称排序看起来非常奇怪. (3认同)