为什么排序-u将U + 2161和U + 2162视为同一个字符?

Yis*_*ang 7 linux sorting unicode gnu-coreutils

我有一个文件,每个文件有两个字符:

$ cat roman
?
?
Run Code Online (Sandbox Code Playgroud)

当我对此文件进行排序时sort -u,只显示一行:

$ sort -u roman
?
Run Code Online (Sandbox Code Playgroud)

?是代码点U + 2161,?是代码点U + 2162.为什么只显示一行?

编辑

$ xxd -g 1 roman
0000000: e2 85 a1 0a e2 85 a2 0a                          ........


$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=en_US.UTF-8
LC_TIME=en_US.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=en_US.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=en_US.UTF-8
LC_NAME=en_US.UTF-8
LC_ADDRESS=en_US.UTF-8
LC_TELEPHONE=en_US.UTF-8
LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=en_US.UTF-8
LC_ALL=
Run Code Online (Sandbox Code Playgroud)

sort是GNU coreutils.

$ sort --version
sort (GNU coreutils) 8.15
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and Paul Eggert.
Run Code Online (Sandbox Code Playgroud)

Edw*_*per 1

尝试设置LC_COLLATE=C;这能解决问题吗?这对我有用:

$ export LANG=en_US.UTF-8
$ export LANGUAGE=en_US:en
$ export LC_CTYPE="en_US.UTF-8"
$ export LC_NUMERIC=en_US.UTF-8
$ export LC_TIME=en_US.UTF-8
$ export LC_COLLATE="en_US.UTF-8"
$ export LC_MONETARY=en_US.UTF-8
$ export LC_MESSAGES="en_US.UTF-8"
$ export LC_PAPER=en_US.UTF-8
$ export LC_NAME=en_US.UTF-8
$ export LC_ADDRESS=en_US.UTF-8
$ export LC_TELEPHONE=en_US.UTF-8
$ export LC_MEASUREMENT=en_US.UTF-8
$ export LC_IDENTIFICATION=en_US.UTF-8
$ export LC_ALL=
$ sort -u foo.txt |wc -l         # <-- with your env variables
1
$ export LC_COLLATE=C
$ sort -u foo.txt |wc -l         # <-- with LC_COLLATE changed to C
2
Run Code Online (Sandbox Code Playgroud)

查看我的 /usr/share/i18n/locales/en_US 副本,我看到:

LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
END LC_COLLATE
Run Code Online (Sandbox Code Playgroud)

这大概就是它的来源。但不知道为什么它告诉这些要整理在一起。