Java - 以独立于系统的方式将UTF8字节从File读入String

Run*_*ion 0 java utf-8

如何准确地将Java中的UTF8编码文件读入字符串?

当我将此.java文件的编码更改为UTF-8(Eclipse> Rightclick on App.java>属性>资源>文本文件编码)时,它可以在Eclipse中运行,但不能在命令行中运行.似乎eclipse在运行App时设置了file.encoding参数.

为什么源文件的编码会对从字节创建String产生任何影响.当编码已知时,从字节创建String的傻瓜式方法是什么?我可能有不同编码的文件.一旦知道文件的编码,我必须能够读入字符串,而不管file.encoding的值是多少?

utf8文件的内容如下

English Hello World.
Korean ?????.
Japanese ????????
Russian ?????? ???.
German Hallo Welt.
Spanish Hola mundo.
Hindi ???? ???????
Gujarati ???? ??????.
Thai ????????????.
Run Code Online (Sandbox Code Playgroud)

- 文件结束 -

代码如下.我的意见在其中的评论中.

public class App {
public static void main(String[] args) {
    String slash = System.getProperty("file.separator");
    File inputUtfFile = new File("C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text.txt");
    File outputUtfFile = new File("C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text_out.txt");
    File outputUtfByteWrittenFile = new File(
            "C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text_byteout.txt");
    outputUtfFile.delete();
    outputUtfByteWrittenFile.delete();

    try {

        /*
         * read a utf8 text file with internationalized strings into bytes.
         * there should be no information loss here, when read into raw bytes.
         * We are sure that this file is UTF-8 encoded. 
         * Input file created using Notepad++. Text copied from Google translate.
         */
        byte[] fileBytes = readBytes(inputUtfFile);

        /*
         * Create a string from these bytes. Specify that the bytes are UTF-8 bytes.
         */
        String str = new String(fileBytes, StandardCharsets.UTF_8);

        /*
         * The console is incapable of displaying this string.
         * So we write into another file. Open in notepad++ to check.
         */
        ArrayList<String> list = new ArrayList<>();
        list.add(str);
        writeLines(list, outputUtfFile);

        /*
         * Works fine when I read bytes and write bytes. 
         * Open the other output file in notepad++ and check. 
         */
        writeBytes(fileBytes, outputUtfByteWrittenFile);

        /*
         * I am using JDK 8u60.
         * I tried running this on command line instead of eclipse. Does not work.
         * I tried using apache commons io library. Does not work. 
         *  
         * This means that new String(bytes, charset); does not work correctly. 
         * There is no real effect of specifying charset to string.
         */
    } catch (IOException e) {
        e.printStackTrace();
    }

}

public static void writeLines(List<String> lines, File file) throws IOException {
    BufferedWriter writer = null;
    OutputStreamWriter osw = null;
    OutputStream fos = null;
    try {
        fos = new FileOutputStream(file);
        osw = new OutputStreamWriter(fos);
        writer = new BufferedWriter(osw);
        String lineSeparator = System.getProperty("line.separator");
        for (int i = 0; i < lines.size(); i++) {
            String line = lines.get(i);
            writer.write(line);
            if (i < lines.size() - 1) {
                writer.write(lineSeparator);
            }
        }
    } catch (IOException e) {
        throw e;
    } finally {
        close(writer);
        close(osw);
        close(fos);
    }
}

public static byte[] readBytes(File file) {
    FileInputStream fis = null;
    byte[] b = null;
    try {
        fis = new FileInputStream(file);
        b = readBytesFromStream(fis);
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        close(fis);
    }
    return b;
}

public static void writeBytes(byte[] inBytes, File file) {
    FileOutputStream fos = null;
    try {
        fos = new FileOutputStream(file);
        writeBytesToStream(inBytes, fos);
        fos.flush();
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        close(fos);
    }
}

public static void close(InputStream inStream) {
    try {
        inStream.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
    inStream = null;
}

public static void close(OutputStream outStream) {
    try {
        outStream.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
    outStream = null;
}

public static void close(Writer writer) {
    if (writer != null) {
        try {
            writer.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
        writer = null;
    }
}

public static long copy(InputStream readStream, OutputStream writeStream) throws IOException {
    int bytesread = -1;
    byte[] b = new byte[4096]; //4096 is default cluster size in Windows for < 2TB NTFS partitions
    long count = 0;
    bytesread = readStream.read(b);
    while (bytesread != -1) {
        writeStream.write(b, 0, bytesread);
        count += bytesread;
        bytesread = readStream.read(b);
    }
    return count;
}
public static byte[] readBytesFromStream(InputStream readStream) throws IOException {
    ByteArrayOutputStream writeStream = null;
    byte[] byteArr = null;
    writeStream = new ByteArrayOutputStream();
    try {
        copy(readStream, writeStream);
        writeStream.flush();
        byteArr = writeStream.toByteArray();
    } finally {
        close(writeStream);
    }
    return byteArr;
}
public static void writeBytesToStream(byte[] inBytes, OutputStream writeStream) throws IOException {
    ByteArrayInputStream bis = null;
    bis = new ByteArrayInputStream(inBytes);
    try {
        copy(bis, writeStream);
    } finally {
        close(bis);
    }
}
};
Run Code Online (Sandbox Code Playgroud)

编辑:对于@JB Nizet,和每个人:)

//writeLines(list, outputUtfFile, StandardCharsets.UTF_16BE); //does not work
//writeLines(list, outputUtfFile, Charset.defaultCharset()); //does not work. 
writeLines(list, outputUtfFile, StandardCharsets.UTF_16LE); //works
Run Code Online (Sandbox Code Playgroud)

我需要在将字节读入String时指定字节编码.当我将字节从String写入文件时,我需要指定字节的编码.

一旦我在JVM中有一个String,我就不需要记住源字节编码了,对吗?

当我写入文件时,它应该将String转换为我的机器的默认Charset(无论是UTF8还是ASCII或cp1252).那是失败的.UTF16 BE也失败了.为什么一些Charsets失败了?

JB *_*zet 5

Java源文件编码确实无关紧要.并且代码的读取部分是正确的(虽然效率低下).不正确的是写作部分:

osw = new OutputStreamWriter(fos);
Run Code Online (Sandbox Code Playgroud)

应改为

osw = new OutputStreamWriter(fos, StandardCharsets.UTF_8);
Run Code Online (Sandbox Code Playgroud)

否则,您使用默认编码(在您的系统上似乎不是UTF8)而不是使用UTF8.

请注意,Java允许在文件路径中使用正斜杠,即使在Windows上也是如此.你可以简单地写

File inputUtfFile = new File("C:/sources/TestUtfRead/utf8text.txt");
Run Code Online (Sandbox Code Playgroud)

编辑:

一旦我在JVM中有一个String,我就不需要记住源字节编码了,对吗?

你是对的.

当我写入文件时,它应该将String转换为我的机器的默认Charset(无论是UTF8还是ASCII或cp1252).那是失败的.

如果您没有指定任何编码,Java确实会使用平台默认编码将字符转换为字节.如果您指定编码(如本答案开头所示),则它会使用您告诉它使用的编码.

但是所有的编码都不能像UTF8那样代表所有的unicode角色.例如,ASCII仅支持128个不同的字符.Cp1252,AFAIK,仅支持256个字符.因此,编码成功,但它用一个特殊的字符替换不可编码的字符(我不记得哪一个)这意味着:我不能编码这个泰语或俄语字符,因为它不是我支持的字符集的一部分.

UTF16编码应该没问题.但是,请确保在读取和显示文件内容时将文本编辑器配置为使用UTF16.如果将其配置为使用其他编码,则显示的内容将不正确.