Ale*_*les 11 java unicode utf8-decode
我正在尝试从http://api.freebase.com/api/trans/raw/m/0h47中检索数据
你可以在文字中看到有这样的歌: /æl?d???ri?/.
当我尝试从页面获取源代码时,我会收到类似唱歌的文字等ú.
到目前为止,我已尝试使用以下代码:
urlConnection.setRequestProperty("Accept-Charset", "UTF-8");
urlConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded;charset=utf-8");
Run Code Online (Sandbox Code Playgroud)
我究竟做错了什么?
我的整个代码:
URL url = null;
URLConnection urlConn = null;
DataInputStream input = null;
try {
url = new URL("http://api.freebase.com/api/trans/raw/m/0h47");
} catch (MalformedURLException e) {e.printStackTrace();}
try {
urlConn = url.openConnection();
} catch (IOException e) { e.printStackTrace(); }
urlConn.setRequestProperty("Accept-Charset", "UTF-8");
urlConn.setRequestProperty("Content-Type", "text/plain; charset=utf-8");
urlConn.setDoInput(true);
urlConn.setUseCaches(false);
StringBuffer strBseznam = new StringBuffer();
if (strBseznam.length() > 0)
strBseznam.deleteCharAt(strBseznam.length() - 1);
try {
input = new DataInputStream(urlConn.getInputStream());
} catch (IOException e) { e.printStackTrace(); }
String str = "";
StringBuffer strB = new StringBuffer();
strB.setLength(0);
try {
while (null != ((str = input.readLine())))
{
strB.append(str);
}
input.close();
} catch (IOException e) { e.printStackTrace(); }
Run Code Online (Sandbox Code Playgroud)
Joo*_*gen 13
HTML页面是UTF-8,可以使用阿拉伯字符等.但是那些Unicode 127以上的字符仍然被编码为数字实体ú.由于UTF-8完全正确,因此Accept-Encoding不会,帮助和加载.
您必须自己解码实体.就像是:
String decodeNumericEntities(String s) {
StringBuffer sb = new StringBuffer();
Matcher m = Pattern.compile("\\&#(\\d+);").matcher(s);
while (m.find()) {
int uc = Integer.parseInt(m.group(1));
m.appendReplacement(sb, "");
sb.appendCodepoint(uc);
}
m.appendTail(sb);
return sb.toString();
}
Run Code Online (Sandbox Code Playgroud)
顺便说一下,这些实体可能来自已处理的HTML表单,因此在Web应用程序的编辑方面也是如此.
在有问题的代码之后:
我用文本的(缓冲)Reader替换了DataInputStream.InputStreams读取二进制数据,字节; 读者文字,字符串.InputStreamReader具有InputStream和编码参数,并返回Reader.
try {
BufferedReader input = new BufferedReader(
new InputStreamReader(urlConn.getInputStream(), "UTF-8"));
StringBuilder strB = new StringBuilder();
String str;
while (null != (str = input.readLine())) {
strB.append(str).append("\r\n");
}
input.close();
} catch (IOException e) {
e.printStackTrace();
}
Run Code Online (Sandbox Code Playgroud)
尝试将用户代理添加到URLConnection:
urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36");
Run Code Online (Sandbox Code Playgroud)
这解决了我的解码问题,就像一个魅力.
| 归档时间: |
|
| 查看次数: |
41518 次 |
| 最近记录: |