对UTF-8字符串进行排序？

Question

对UTF-8字符串进行排序？

jma*_*erx 4 c++ unicode

我的std :: strings以UTF-8编码,因此std :: string <运算符不会删除它.我怎么能比较2个utf-8编码的std :: strings？

它不切割的地方是口音,é来自z,它不应该

谢谢

Answer 1

Gre*_*ill 6

如果你不想要词典排序(这是按字典顺序排序UTF-8编码的字符串会给你),那么你需要根据需要将你的UTF-8编码字符串解码为UCS-2或UCS-4,并且应用您选择的合适比较功能.

重申一点,UTF-8编码机制设计巧妙,如果您通过查看每个8位编码字节的数值进行排序,您将获得与首次将字符串解码为Unicode并进行比较时相同的结果每个代码点的数值.

更新:您更新的问题表明您需要比纯粹的词典排序更复杂的比较功能.您需要解码UTF-8字符串并比较解码后的字符.

取决于您的语言环境.在德语中,ö先于p.在瑞典语中,相同的字母在字母表的末尾排序. (8认同)
@Milo:在许多语言中'''不会在"e"之后出现,它会对它们进行排序,因此从这两个字母开始的两个单词基于其首字母后面的内容排序.在某些语言中,一些重音字母的排序与它们的非重音字母排序不同,而某些语言的字母排序与组成它们的两个字符的排序方式不同.例如捷克语中的'e'和'ě'排序相同但'č'排序后'c'和'ch'排序后'h'(IIRC).有关详细信息,请参见http://userguide.icu-project.org/collation和http://www.unicode.org/reports/tr10/. (6认同)
排序(排序)和编码是两个完全独立的问题,除非您将它们视为字节数组ANSI样式.http://www.joelonsoftware.com/articles/Unicode.html (3认同)

Answer 2

eph*_*ent 6

该标准std::locale适用于特定于语言环境的事物,例如整理(排序).如果环境包含LC_COLLATE=en_US.utf8或类似,则此程序将根据需要对行进行排序.

#include <algorithm>
#include <functional>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>
#include <vector>
class collate_in : public std::binary_function<std::string, std::string, bool> {
  protected:
    const std::collate<char> &coll;
  public:
    collate_in(std::locale loc)
        : coll(std::use_facet<std::collate<char> >(loc)) {}
    bool operator()(const std::string &a, const std::string &b) const {
        // std::collate::compare() takes C-style string (begin, end)s and
        // returns values like strcmp or strcoll.  Compare to 0 for results
        // expected for a less<>-style comparator.
        return coll.compare(a.c_str(), a.c_str() + a.size(),
                            b.c_str(), b.c_str() + b.size()) < 0;
    }
};
int main() {
    std::vector<std::string> v;
    copy(std::istream_iterator<std::string>(std::cin),
         std::istream_iterator<std::string>(), back_inserter(v));
    // std::locale("") is the locale from the environment.  One could also
    // std::locale::global(std::locale("")) to set up this program's global
    // first, and then use locale() to get the global locale, or choose a
    // specific locale instead of using the environment's.
    sort(v.begin(), v.end(), collate_in(std::locale("")));
    copy(v.begin(), v.end(),
         std::ostream_iterator<std::string>(std::cout, "\n"));
    return 0;
}

Run Code Online (Sandbox Code Playgroud)

$ cat >file
f
é
e
d
^D
$ LC_COLLATE=C ./a.out file
d
e
f
é
$ LC_COLLATE=en_US.utf8 ./a.out file
d
e
é
f

它引起了我的注意,std::locale::operator()(a, b)存在,避免了std::collate<>::compare(a, b) < 0我上面写的包装.

#include <algorithm>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>
#include <vector>
int main() {
    std::vector<std::string> v;
    copy(std::istream_iterator<std::string>(std::cin),
         std::istream_iterator<std::string>(), back_inserter(v));
    sort(v.begin(), v.end(), std::locale(""));
    copy(v.begin(), v.end(),
         std::ostream_iterator<std::string>(std::cout, "\n"));
    return 0;
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，8 月前
查看次数：	8342 次
最近记录：	9 年，10 月前