标签: utf-16

Java如何在其16位字符类型中存储UTF-16字符？

根据Java SE 7规范,Java使用Unicode UTF-16标准来表示字符.当想象一个String作为简单阵列 16位变量中的每个包含一个字符,寿命也很简单.

不幸的是,有16位的代码点是不够的(我相信它是所有Unicode字符的16/17).所以在a中String,这没有直接问题,因为当想要使用额外的两个字节存储这些~1.048.576个字符中的一个时,String将仅使用其中的两个数组位置.

这没有任何直接问题,适用于Strings,因为总有一个额外的两个字节.虽然单变量与UTF-16编码相比,具有16位的固定长度,但是如何存储这些字符,特别是Java如何使用其2字节"char"来完成类型？

java variables unicode encoding utf-16

Kie*_*row

2012 10-29

23
推荐指数

2
解决办法

8953
查看次数

如何使用BOM对UTF-16LE字节数组进行编码/解码？

我需要对UTF-16字节数组进行编码/解码java.lang.String.字节数组通过字节顺序标记(BOM)提供给我,我需要使用BOM编码字节数组.

此外,因为我正在处理Microsoft客户端/服务器,所以我想以小端(以及LE BOM)发出编码,以避免任何误解.我确实意识到使用BOM它应该工作大端,但我不想在Windows世界游泳上游.

作为一个例子,这里是其编码的方法java.lang.String作为UTF-16与BOM小端:

public static byte[] encodeString(String message) {

    byte[] tmp = null;
    try {
        tmp = message.getBytes("UTF-16LE");
    } catch(UnsupportedEncodingException e) {
        // should not possible
        AssertionError ae =
        new AssertionError("Could not encode UTF-16LE");
        ae.initCause(e);
        throw ae;
    }

    // use brute force method to add BOM
    byte[] utf16lemessage = new byte[2 + tmp.length];
    utf16lemessage[0] = (byte)0xFF;
    utf16lemessage[1] = (byte)0xFE;
    System.arraycopy(tmp, 0,
                     utf16lemessage, 2,
                     tmp.length);
    return utf16lemessage;
}

Run Code Online (Sandbox Code Playgroud)

在Java中执行此操作的最佳方法是什么？理想情况下,我希望避免将整个字节数组复制到一个新的字节数组中,该数组在开头分配了两个额外的字节.

解码这样的字符串也是如此,但使用java.lang.String构造函数 …

java unicode byte-order-mark utf-16

Jar*_*aus

2009 05-19

22
推荐指数

3
解决办法

3万
查看次数

有效的区域设置名称

如何找到有效的区域设置名称？

我目前正在使用MAC OS X.
但是有关其他平台的信息也很有用.

#include <fstream>
#include <iostream>


int main(int argc,char* argv[])
{
    try
    {
        std::wifstream  data;
        data.imbue(std::locale("en_US.UTF-16"));
        data.open("Plop");
    }
    catch(std::exception const& e)
    {
        std::cout << "Exception: " << e.what() << "\n";
        throw;
    }
}

% g++ main.cpp
% ./a.out
Exception: locale::facet::_S_create_c_locale name not valid
Abort

Run Code Online (Sandbox Code Playgroud)

c++ locale utf-16

Mar*_*ork

2009 12-18

22
推荐指数

1
解决办法

1万
查看次数

使用JNA获取/设置应用程序标识符

继续我之前关于Windows 7任务栏的问题,我想诊断为什么Windows不承认我的应用程序是独立的javaw.exe.我目前有以下JNA代码来获取AppUserModelID:

public class AppIdTest {

    public static void main(String[] args) {
        NativeLibrary lib;
        try {
            lib = NativeLibrary.getInstance("shell32");
        } catch (Error e) {
            System.err.println("Could not load Shell32 library.");
            return;
        }
        Object[] functionArgs = new Object[1];
        String functionName = null;
        Function function;
        try {
            functionArgs[0] = new String("Vendor.MyJavaApplication")
                    .getBytes("UTF-16");
            functionName = "GetCurrentProcessExplicitAppUserModelID";
            function = lib.getFunction(functionName);
            // Output the current AppId
            System.out.println("1: " + function.getString(0));
            functionName = "SetCurrentProcessExplicitAppUserModelID";
            function = lib.getFunction(functionName);
            // Set the new …

Run Code Online (Sandbox Code Playgroud)

java utf-16 jna windows-7

Pau*_*sma

2017 05-23

21
推荐指数

1
解决办法

6554
查看次数

转换utf-16 - > utf-8并删除BOM

我们有一个数据输入人员在Windows上以UTF-16编码,并希望拥有utf-8并删除BOM.utf-8转换有效但BOM仍然存在.我该如何删除？这就是我目前拥有的:

batch_3={'src':'/Users/jt/src','dest':'/Users/jt/dest/'}
batches=[batch_3]

for b in batches:
  s_files=os.listdir(b['src'])
  for file_name in s_files:
    ff_name = os.path.join(b['src'], file_name)  
    if (os.path.isfile(ff_name) and ff_name.endswith('.json')):
      print ff_name
      target_file_name=os.path.join(b['dest'], file_name)
      BLOCKSIZE = 1048576
      with codecs.open(ff_name, "r", "utf-16-le") as source_file:
        with codecs.open(target_file_name, "w+", "utf-8") as target_file:
          while True:
            contents = source_file.read(BLOCKSIZE)
            if not contents:
              break
            target_file.write(contents)

Run Code Online (Sandbox Code Playgroud)

如果我hexdump -CI看到:

Wed Jan 11$ hexdump -C svy-m-317.json 
00000000  ef bb bf 7b 0d 0a 20 20  20 20 22 6e 61 6d 65 22  |...{..    "name"|
00000010  3a 22 53 …

Run Code Online (Sandbox Code Playgroud)

python unicode utf-8 utf-16

tim*_*one

2019 07-08

21
推荐指数

2
解决办法

3万
查看次数

如何在nodejs中读取以utf-16编码的文件？

我必须使用nodejs读取以UTF-16编码的文件(因为它非常大,因此在块中).来自文件的数据将进入mongodb,因此我需要将其转换为utf-8.从谷歌搜索看来,这似乎是Node不支持的,我将不得不求助于自己从缓冲区转换原始数据.但我也认为应该有一个更好的方法,我只是没有找到它.有什么建议？

谢谢.

utf-16 node.js

Rya*_*yne

lucky-day

20
推荐指数

2
解决办法

1万
查看次数

执行os.walk时出现UnicodeDecodeError

我收到错误:

'ascii' codec can't decode byte 0x8b in position 14: ordinal not in range(128)

Run Code Online (Sandbox Code Playgroud)

当试图做os.walk时.发生此错误是因为目录中的某些文件中包含0x8b(非utf8)字符.这些文件来自Windows系统(因此是utf-16文件名),但我已将文件复制到Linux系统,并使用python 2.7(在Linux中运行)遍历目录.

我已经尝试将一个unicode启动路径传递给os.walk,它生成的所有文件和dirs都是unicode名称,直到它出现非utf8名称,然后由于某种原因,它不会将这些名称转换为unicode和然后代码在utf-16名称上窒息.无论如何要解决这个问题,而不是手动查找和更改所有令人反感的名字？

如果在python2.7中没有解决方案,是否可以在python3中编写脚本来遍历文件树并通过将它们转换为utf-8来修复坏文件名(通过删除非utf8字符)？注意,除了0x8b之外,名称中还有许多非utf8字符,因此需要以一般方式工作.

更新:0x8b仍然只是一个btye char(只是无效的ascii)的事实使它更令人费解.我已经验证将这样的字符串转换为unicode存在问题,但是可以直接创建unicode版本.以机智:

>>> test = 'a string \x8b with non-ascii'
>>> test
'a string \x8b with non-ascii'
>>> unicode(test)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 9: ordinal not in  range(128)
>>> 
>>> test2 = u'a string \x8b with non-ascii'
>>> test2
u'a string \x8b with non-ascii'

Run Code Online (Sandbox Code Playgroud)

这是我得到的错误的回溯: