在Java中从BufferedReader到BufferedWriter的字符损坏

Mik*_*ail 3 java special-characters html-parsing bufferedwriter bufferedreader

在Java中,我试图解析包含复杂文本(如希腊符号)的HTML文件.

当文本包含左向引号时,我遇到一个已知问题.文字如

mutations to particular “hotspot” regions
Run Code Online (Sandbox Code Playgroud)

 mutations to particular “hotspot?? regions
Run Code Online (Sandbox Code Playgroud)

我通过写一个简单的文本副本meathod来解决这个问题:

public static int CopyFile()
{
    try
    {
    StringBuffer sb = null;
    String NullSpace = System.getProperty("line.separator");
    Writer output = new BufferedWriter(new FileWriter(outputFile));
    String line;
    BufferedReader input =  new BufferedReader(new FileReader(myFile));
while((line = input.readLine())!=null)
    {
        sb = new StringBuffer();
        //Parsing would happen
        sb.append(line);
        output.write(sb.toString()+NullSpace);
    }
        return 0;
    }
    catch (Exception e)
    {
        return 1;
    }
}
Run Code Online (Sandbox Code Playgroud)

任何人都可以提供一些建议,如何纠正这个问题?

★我的解决方案

InputStream in = new FileInputStream(myFile);
        Reader reader = new InputStreamReader(in,"utf-8");
        Reader buffer = new BufferedReader(reader);
        Writer output = new BufferedWriter(new FileWriter(outputFile));
        int r;
        while ((r = reader.read()) != -1)
        {
            if (r<126)
            {
                output.write(r);
            }
            else
            {
                output.write("&#"+Integer.toString(r)+";");
            }
        }
        output.flush();
Run Code Online (Sandbox Code Playgroud)

Thi*_*Roy 6

读取的文件与写入的文件(可能是ISO-8859-1)的编码(可能是UTF-8)不同.

请尝试以下操作以生成具有UTF-8编码的文件:

BufferedWriter output = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile),"UTF8"));
Run Code Online (Sandbox Code Playgroud)

不幸的是,确定文件的编码非常困难.请参阅Java:如何确定流的正确charset编码