Java - 无法在 BufferedReader 中正确读取特殊字符

Question

Java - 无法在 BufferedReader 中正确读取特殊字符

NSC*_*NSC 2 java bufferedreader

我已经创建了从 csv 文件读取数据的代码。但是，我无法处理 \xc2\xa3 等特殊字符。

\n\n

例如，My Base Cost (K\xc2\xa3)被读作My Base Cost (K\xc3\x83\xe2\x80\x9a\xc3\x82\xc2\xa3).

\n\n

我可以做什么来纠正这个问题？

\n\n

public void parseCSVFile(String filename){\n\n     try {\n            br = new BufferedReader(new FileReader(csvDirectory + filename));\n\n            while ((parsedLines = br.readLine()) != null) {\n\n                String[] parsedData = parsedLines.split(csvSplitByComma);\n\n                entireFeed.add(parsedData[0]);\n                entireFeed.add(parsedData[1]);\n\n                System.out.println(parsedData[0]);\n                System.out.println(parsedData[1]);\n\n                it = entireFeed.iterator();\n            }\n        } catch (Exception e) {\n            e.printStackTrace();\n        }\n}\n

Run Code Online (Sandbox Code Playgroud)\n

Answer 1

VGR*_*VGR 5

写入 CSV 的代码已损坏。它以 UTF-8 对其编写的文本进行三重编码。

\n\n

在 UTF-8 中，ASCII 字符（代码点 0\xe2\x80\x93127）表示为单个字节；他们不需要编码。那\xe2\x80\x99s为什么只\xc2\xa3受到影响。

\n\n

\xc2\xa3需要 UTF-8 中的两个字节。这些字节是：0xc2、0xa3。如果编写 CSV 文件的代码正确使用了 UTF-8，则该字符将在文件中显示为这两个字节。

\n\n

但是，显然，某处的某些代码使用单字节字符集（如 ISO-8859-1）读取文件，导致每个单独的字节被视为其自己的字符。然后它使用 UTF-8 对这些单个字符进行编码。意思是，它采用了 { 0xc2, 0xa3 } 字节并以 UTF-8对其进行编码。这又产生了这些字节：0xc3、0x82、0xc2、0xa3。（具体来说：U+00C2字符在UTF-8中表示为0xc3 0x82，U+00A3字符在UTF-8中表示为0xc2 0xa3。）

\n\n

然后，过了一段时间，同样的事情又发生了。 这四个字节是使用单字节字符集读取的，每个字节都被视为自己的字符，这四个字符中的每一个都以 UTF-8 进行编码，从而产生八个字节：0xc3、0x83、0xc2、0x82、0xc3、 0x82、0xc2、0xa3。（当编码为 UTF-8 时，并非每个字符都会转换为两个字节；只是碰巧所有这些字符都转换为两个字节。）

\n\n

这就是为什么当您使用 ISO-8859-1 字符集读取文件时，每个字节都会得到一个字符：

\n\n

\xc3\x83   \xc6\x92   \xc3\x82   \xe2\x80\x9a   \xc3\x83   \xe2\x80\x9a   \xc3\x82   \xc2\xa3\nc3  83  c2  82  c3  82  c2  a3\n

Run Code Online (Sandbox Code Playgroud)\n\n

（从技术上讲，\xe2\x80\x9a实际上是 U+201A“单 Low-9 引号”，但许多每字符一字节的 Windows 字体历史上在位置 0x82 处都有该字符。）

\n\n

那么，既然我们知道你的文件是如何得到的，你会怎么做呢？

\n\n

首先，停止让事情变得更糟。如果您可以控制写入文件的代码\xe2\x80\x99，请确保代码显式指定用于读取和写入的字符集。UTF-8 几乎总是最佳选择，至少对于任何主要使用西方字符的文件来说是这样。

\n\n

第二，如何修复该文件？没有办法自动检测这种错误编码，我担心，但至少在这个文件的情况下，您可以对其进行三次解码。

\n\n

如果文件不是很大，你可以将其全部读入内存：

\n\n

byte[] bytes = Files.readAllBytes(Paths.get(csvDirectory, filename));\n// First decoding: \xc2\xa3 is represented as four characters\nString content = new String(bytes, "UTF-8");\n\nbytes = new byte[content.length()];\nfor (int i = content.length() - 1; i >= 0; i--) {\n    bytes[i] = (byte) content.charAt(i);\n}\n// Second decoding: \xc2\xa3 is represented as two characters\ncontent = new String(bytes, "UTF-8");\n\nbytes = new byte[content.length()];\nfor (int i = content.length() - 1; i >= 0; i--) {\n    bytes[i] = (byte) content.charAt(i);\n}\n// Third decoding: \xc2\xa3 is represented as one character\ncontent = new String(bytes, "UTF-8");\n\nbr = new BufferedReader(new StringReader(content));\n\n// ...\n

Run Code Online (Sandbox Code Playgroud)\n\n

如果它\xe2\x80\x99是一个大文件，您将需要将每一行读取为字节：

\n\n

try (InputStream in = new BufferedInputStream(\n    Files.newInputStream(Paths.get(csvDirectory, filename)))) {\n\n    ByteBuffer lineBuffer = ByteBuffer.allocate(64 * 1024);\n\n    int b = 0;\n    while (b >= 0) {\n        lineBuffer.clear();\n\n        for (b = in.read();\n             b >= 0 && b != \'\\n\' && b != \'\\r\';\n             b = in.read()) {\n\n            lineBuffer.put((byte) b);\n        }\n\n        if (b == \'\\r\') {\n            in.mark(1);\n            if (in.read() != \'\\n\') {\n                in.reset();\n            }\n        }\n\n        lineBuffer.flip();\n        byte[] bytes = new byte[lineBuffer.limit()];\n        lineBuffer.get(bytes);\n\n        // First decoding: \xc2\xa3 is represented as four characters\n        String parsedLine = new String(bytes, "UTF-8");\n\n        bytes = new byte[parsedLine.length()];\n        for (int i = parsedLine.length() - 1; i >= 0; i--) {\n            bytes[i] = (byte) parsedLine.charAt(i);\n        }\n        // Second decoding: \xc2\xa3 is represented as two characters\n        parsedLine = new String(bytes, "UTF-8");\n\n        bytes = new byte[parsedLine.length()];\n        for (int i = parsedLine.length() - 1; i >= 0; i--) {\n            bytes[i] = (byte) parsedLine.charAt(i);\n        }\n        // Third decoding: \xc2\xa3 is represented as one character\n        parsedLine = new String(bytes, "UTF-8");\n\n        // ...\n    }\n}\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	9 年，6 月前
查看次数：	7239 次
最近记录：	9 年，6 月前