如何读取UCS-2文件？

Question

如何读取UCS-2文件？

gos*_*eta 4 c++ unicode encoding character-encoding wofstream

我正在编写一个程序来获取UCS-2 Little Endian中*.rc文件编码的信息.

int _tmain(int argc, _TCHAR* argv[]) {
  wstring csvLine(wstring sLine);
  wifstream fin("en.rc");
  wofstream fout("table.csv");
  wofstream fout_rm("temp.txt");
  wstring sLine;
  fout << "en\n";
  while(getline(fin,sLine)) {
    if (sLine.find(L"IDS") == -1)
      fout_rm << sLine << endl;
    else
      fout << csvLine(sLine);
  }
  fout << flush;
  system("pause");
  return 0;
}

Run Code Online (Sandbox Code Playgroud)

在"en.rc"的第一行是#include <windows.h>但sLine如下所示:

[0]     255 L'ÿ'
[1]     254 L'þ'
[2]     35  L'#'
[3]     0
[4]     105 L'i'
[5]     0
[6]     110 L'n'
[7]     0
[8]     99  L'c'
.       .
.       .
.       .

Run Code Online (Sandbox Code Playgroud)

该程序可以正确地用于UTF-8.我该怎么做到UCS-2？

Answer 1

bam*_*s53 8

宽流使用宽流缓冲区来访问文件.Wide流缓冲区从文件中读取字节,并使用其codecvt facet将这些字节转换为宽字符.默认的codecvt方面是std::codecvt<wchar_t, char ,std::mbstate_t>在wchar_t和char(即,喜欢mbstowcs()的本机字符集之间进行转换.

你没有使用本机字符集,所以你想要的是一个codecvt facet,它读作多UCS-2字节序列并将其转换为宽字符.

#include <fstream>
#include <string>
#include <codecvt>
#include <iostream>

int main(int argc, char *argv[])
{
    wifstream fin("en.rc", std::ios::binary); // You need to open the file in binary mode

    // Imbue the file stream with a codecvt facet that uses UTF-16 as the external multibyte encoding
    fin.imbue(std::locale(fin.getloc(),
              new std::codecvt_utf16<wchar_t, 0xffff, consume_header>));

    // ^ We set 0xFFFF as the maxcode because that's the largest that will fit in a single wchar_t
    //   We use consume_header to detect and use the UTF-16 'BOM'

    // The following is not really the correct way to write Unicode output, but it's easy
    std::wstring sLine;
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> convert;
    while (getline(fin, sLine))
    {
        std::cout << convert.to_bytes(sLine) << '\n';
    }
}

Run Code Online (Sandbox Code Playgroud)

请注意,此处存在问题UTF-16.目的wchar_t是为了wchar_t表示一个代码点.但是,Windows使用UTF-16代表一些代码点作为两个代码点wchar_t.这意味着标准API在Windows上不能很好地工作.

这样的结果是,当文件包含代理对时,codecvt_utf16将读取该对,将其转换为大于16位的单个代码点值,并且必须将该值截断为16位以将其粘贴到a中wchar_t.这意味着此代码实际上仅限于此UCS-2.我已经设置了maxcode模板参数0xFFFF来反映这一点.

还有许多其他问题wchar_t,您可能希望完全避免它:C++ wchar_t的"错误"是什么？

归档时间：	13 年，5 月前
查看次数：	4250 次
最近记录：	12 年，4 月前