我有一个ANSI编码的文本文件,不应该编码为ANSI,因为ANSI不支持重音字符.我宁愿使用UTF-8.
数据可以正确解码还是在转码中丢失?
我可以使用哪些工具?
以下是我的样本:
ç é
Run Code Online (Sandbox Code Playgroud)
我可以从上下文(café应该是café)告诉我们这些应该是这两个字符:
ç é
Run Code Online (Sandbox Code Playgroud) 我的页面经常显示像Ã,Ã,Ã,Ã,Ã,代替普通字符的东西.
我使用utf8作为头页和MySQL编码.这是怎么发生的?
任何人都可以告诉我Unicode可打印字符的范围是什么?[例如Ascii可打印字符范围是\ u0020 - \u007f]
我对Java中的字符串编码感到困惑.我有一些问题.如果您知道答案,请帮助我:
1)内存中Java字符串的本机编码是什么?我写的String a = "Hello"时候会存储哪种格式?由于Java与机器无关,我认为系统不会进行编码.
2)我在网上读到"UTF-16"是默认编码,但我感到困惑,因为我写的时候说int a = 'c'我得到了ASCII表中字符的编号.那么ASCII和UTF-16是一样的吗?
3)我还不确定内存中字符串的存储取决于:操作系统,语言?
我使用Jersey 1.11制作了一个小型的Rest Web服务.当我调用返回Json的url时,非英文字符的字符编码存在问题.Xml的相应url("test.xml"在启动的xml-tag中使其成为utf-8.
如何让网址"test.json"返回utf-8编码的响应?
这是服务的代码:
@Stateless
@Path("/")
public class RestTest {
@EJB
private MyDao myDao;
@Path("test.xml/")
@GET
@Produces(MediaType.APPLICATION_XML )
public List<Profile> getProfiles() {
return myDao.getProfilesForWeb();
}
@Path("test.json/")
@GET
@Produces(MediaType.APPLICATION_JSON)
public List<Profile> getProfilesAsJson() {
return myDao.getProfilesForWeb();
}
}
Run Code Online (Sandbox Code Playgroud)
这是服务使用的pojo:
package se.kc.mimee.profile.model;
@XmlRootElement
public class Profile {
public int id;
public String name;
public Profile(int id, String name) {
this.id = id;
this.name = name;
}
public Profile() {}
}
Run Code Online (Sandbox Code Playgroud) 我使用Javax Mail API发送电子邮件.我使用联系方式发送输入,必须将其发送到特定的电子邮件.
发送电子邮件没有问题,虽然我是一个丹麦人,因此我需要三个丹麦字符,即'æ','ø'和'å',在主题和电子邮件文本中.
因此我看到我可以使用UTF-8字符编码来提供这些字符,但是当我的邮件发送时我只看到一些奇怪的字母 - 'ã|','ã¸'和'ã¥' - 而不是丹麦语字母 - 'æ','ø'和'å'.
我发送电子邮件的方法看起来像这样
public void sendEmail(String name, String fromEmail, String subject, String message) throws AddressException, MessagingException, UnsupportedEncodingException, SendFailedException
{
//Set Mail properties
Properties props = System.getProperties();
props.setProperty("mail.smtp.starttls.enable", "true");
props.setProperty("mail.smtp.host", "smtp.gmail.com");
props.setProperty("mail.smtp.socketFactory.port", "465");
props.setProperty("mail.smtp.socketFactory.class", "javax.net.ssl.SSLSocketFactory");
props.setProperty("mail.smtp.auth", "true");
props.setProperty("mail.smtp.port", "465");
Session session = Session.getDefaultInstance(props, new javax.mail.Authenticator() {
@Override
protected PasswordAuthentication getPasswordAuthentication() {
return new PasswordAuthentication("my_username", "my_password");
}
});
//Create the email with variable input
MimeMessage mimeMessage = new MimeMessage(session);
mimeMessage.setHeader("Content-Type", "text/plain; charset=UTF-8");
mimeMessage.setFrom(new …Run Code Online (Sandbox Code Playgroud) 我正在寻找一种方法来检测文档中的字符集.我一直在这里阅读Mozilla字符集检测实现:
我还发现了一个名为jCharDet的Java实现:
这两者都是基于使用一组静态数据进行的研究.我想知道的是,是否有人成功使用过任何其他实现,如果有的话,是什么?你有自己的方法吗?如果是的话,你用来检测字符集的算法是什么?
任何帮助,将不胜感激.我不是在寻找通过谷歌的现有方法列表,也不是在寻找Joel Spolsky文章的链接 - 只是为了澄清:)
更新:我对此进行了大量研究,最终找到了一个名为cpdetector的框架,该框架使用可插入的方法进行字符检测,请参阅:
这提供了BOM,chardet(Mozilla方法)和ASCII检测插件.编写自己的代码也很容易.还有另一个框架,它提供了更好的字符检测,Mozilla方法/ jchardet等......
为cpdetector编写自己的插件非常容易,它使用这个框架来提供更准确的字符编码检测算法.它比Mozilla方法更好用.
我有.txt和.java文件,我不知道如何确定文件的编码表(Unicode,UTF-8,ISO-8525,...).是否存在任何程序来确定文件编码或查看编码?
我正在尝试使用Process.Start重定向的I/O来调用PowerShell.exe字符串,并以UTF-8的形式返回输出.但我似乎无法做到这一点.
我尝试过的:
-Command参数运行Console.OutputEncoding在这两个我的控制台应用程序,并在PowerShell脚本$OutputEncoding在PowerShell中设置Process.StartInfo.StandardOutputEncodingEncoding.Unicode而不是做到这一切Encoding.UTF8在每种情况下,当我检查我给出的字节时,我会得到与原始字符串不同的值.我真的很想解释为什么这不起作用.
这是我的代码:
static void Main(string[] args)
{
DumpBytes("Héllo");
ExecuteCommand("PowerShell.exe", "-Command \"$OutputEncoding = [System.Text.Encoding]::UTF8 ; Write-Output 'Héllo';\"",
Environment.CurrentDirectory, DumpBytes, DumpBytes);
Console.ReadLine();
}
static void DumpBytes(string text)
{
Console.Write(text + " " + string.Join(",", Encoding.UTF8.GetBytes(text).Select(b => b.ToString("X"))));
Console.WriteLine();
}
static int ExecuteCommand(string executable, string arguments, string workingDirectory, Action<string> output, Action<string> error)
{
try
{
using …Run Code Online (Sandbox Code Playgroud) 我需要检测损坏的文本文件,其中存在无效(非ASCII)utf-8,Unicode或二进制字符.
�>t�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½o��������ï¿ï¿½_��������������������o����������������������￿����ß����������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~�ï¿ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½}���������}w��׿��������������������������������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~������������������������������������_������������������������������������������������������������������������������^����ï¿ï¿½s�����������������������������?�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½}����������ï¿ï¿½ï¿½ï¿½ï¿½y����������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½o�������������������������}��
Run Code Online (Sandbox Code Playgroud)
我试过的:
iconv -f utf-8 -t utf-8 -c file.csv
Run Code Online (Sandbox Code Playgroud)
这将文件从utf-8编码转换为utf-8编码,-c用于跳过无效的utf-8字符.然而最后这些非法字符仍然被打印出来.在linux或其他语言的bash中还有其他解决方案吗?
java ×4
utf-8 ×4
encoding ×3
unicode ×2
bash ×1
codepages ×1
jakarta-mail ×1
jersey ×1
linux ×1
mojibake ×1
mysql ×1
php ×1
powershell ×1
string ×1
text ×1
text-files ×1
utf8-decode ×1