如何将widestring转换为unicode字节串?

dan*_*tei -4 delphi unicode delphi-6

当我在记事本中创建一个文件,包含(示例)字符串1d并保存为unicode文件时,我得到一个包含字节的6字节大小的文件 #255#254#49#0#100#0.

好.现在我需要一个Delphi 6函数,它接受(示例)输入widestring 1d并返回包含#255#254#49#0#100#0(和反之亦然)的字符串.

怎么样?谢谢.d

Rem*_*eau 5

如果使用十六进制,则更容易读取字节. #255#254#49#0#100#0以十六进制表示为

FF FE 31 00 64 00

哪里

FF FEUTF-16LE BOM,它使用Little Endian中的值将以下字节标识为UTF-16编码.

31 00 是ASCII字符 '1'

64 00是ASCII字符'd'.

要创建WideString包含这些字节的内容非常简单:

var
  W: WideString;
  S: String;
begin
  S := '1d';
  W := WideChar($FEFF) + S;
end;
Run Code Online (Sandbox Code Playgroud)

当a AnsiString(这是Delphi 6的默认字符串类型)被分配给a时WideString,RTL会AnsiString使用本地计算机的默认Ansi字符集自动将数据从8位转换为UTF-16LE以进行转换.

走另一条路也很简单:

var
  W: WideString;
  S: String;
begin
  W := WideChar($FEFF) + '1d';
  S := Copy(W, 2, MaxInt);
end;
Run Code Online (Sandbox Code Playgroud)

将a分配给a WideStringAnsiString,RTL会WideString使用默认的Ansi字符集自动将UTF-16LE中的数据转换为8位.

如果默认的Ansi字符集不适合您的需要(比如8位数据需要在不同的字符集中编码),您必须直接使用Win32 API MultiByteToWideChar()WideCharToMultiByte()函数(或具有同等功能的第三方库),这样您就可以了可以根据需要指定所需的字符集/代码页.

现在,Delphi 6没有提供任何有用的帮助程序来读取Unicode文件(Delphi 2009及更高版本),因此您必须自己手动执行,例如:

function ReadUnicodeFile(const FileName: string): WideString;
const
  cBOM_UTF8: array[0..2] of Byte = ($EF, $BB, $BF);
  cBOM_UTF16BE: array[0..1] of Byte = ($FE, $FF);
  cBOM_UTF16LE: array[0..1] of Byte = ($FF, $FE); 
  cBOM_UTF32BE: array[0..3] of Byte = ($00, $00, $FE, $FF);
  cBOM_UTF32LE: array[0..3] of Byte = ($FF, $FE, $00, $00);
var
  FS: TFileStream;
  BOM: array[0..3] of Byte;
  NumRead: Integer;
  U8: UTF8String;
  U32: UCS4String;
  I: Integer;
begin
  Result := '';
  FS := TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite);
  try
    NumRead := FS.Read(BOM, 4);

    // UTF-8
    if (NumRead >= 3) and CompareMem(@BOM, @cBOM_UTF8, 3) then
    begin
      if NumRead > 3 then
        FS.Seek(-(NumRead-3), soCurrent);
      SetLength(U8, FS.Size - FS.Position);
      if Length(U8) > 0 then
      begin
        FS.ReadBuffer(PAnsiChar(U8)^, Length(U8));
        Result := UTF8Decode(U8);
      end;
    end

    // the UTF-16LE and UTF-32LE BOMs are ambiguous! Check for UTF-32 first...

    // UTF-32
    else if (NumRead = 4) and (CompareMem(@BOM, cBOM_UTF32LE, 4) or CompareMem(@BOM, cBOM_UTF32BE, 4)) then
    begin
      // UCS4String is not a true string type, it is a dynamic array, so
      // it must include room for a null terminator...
      SetLength(U32, ((FS.Size - FS.Position) div SizeOf(UCS4Char)) + 1);
      if Length(U32) > 1 then
      begin
        FS.ReadBuffer(PUCS4Chars(U32)^, (Length(U32) - 1) * SizeOf(UCS4Char));
        if CompareMem(@BOM, cBOM_UTF32BE, 4) then
        begin
          for I := Low(U32) to High(U32) do
          begin
            U32[I] := ((U32[I] and $000000FF) shl 24) or
                      ((U32[I] and $0000FF00) shl 8) or
                      ((U32[I] and $00FF0000) shr 8) or
                      ((U32[I] and $FF000000) shr 24);
          end;
        end;
        U32[High(U32)] := 0;
        // Note: UCS4StringToWidestring() does not actually support UTF-16,
        // only UCS-2! If you need to handle UTF-16 surrogates, you will
        // have to convert from UTF-32 to UTF-16 manually, there is no RTL
        // or Win32 function that will do it for you...
        Result := UCS4StringToWidestring(U32);
      end;
    end

    // UTF-16
    else if (NumRead >= 2) and (CompareMem(@BOM, cBOM_UTF16LE, 2) or CompareMem(@BOM, cBOM_UTF16BE, 2)) then
    begin
      if NumRead > 2 then
        FS.Seek(-(NumRead-2), soCurrent);
      SetLength(Result, (FS.Size - FS.Position) div SizeOf(WideChar));
      if Length(Result) > 0 then
      begin
        FS.ReadBuffer(PWideChar(Result)^, Length(Result) * SizeOf(WideChar));
        if CompareMem(@BOM, cBOM_UTF16BE, 2) then
        begin
          for I := 1 to Length(Result) then
          begin
            Result[I] := WideChar(
                           ((Word(Result[I]) and $00FF) shl 8) or
                           ((Word(Result[I]) and $FF00) shr 8)
                         );
            end;
        end;
      end;
    end

    // something else, assuming UTF-8
    else
    begin
      if NumRead > 0 then
        FS.Seek(-NumRead, soCurrent);
      SetLength(U8, FS.Size - FS.Position);
      if Length(U8) > 0 then
      begin
        FS.ReadBuffer(PAnsiChar(U8)^, Length(U8));
        Result := UTF8Decode(U8);
      end;
    end;
  finally
    FS.Free;
  end;
end;
Run Code Online (Sandbox Code Playgroud)

更新:如果你想在AnsiString变量中存储UTF-16LE编码的字节(为什么?),那么你可以Move()将一个WideString字符数据的原始字节放入一个内存块AnsiString:例如:

function WideStringAsAnsi(const AValue: WideString): AnsiString;
begin
  SetLength(Result, Length(AValue) * SizeOf(WideChar));
  Move(PWideChar(AValue)^, PAnsiChar(Result)^, Length(Result));
end;
Run Code Online (Sandbox Code Playgroud)

var
  W: WideString;
  S: AnsiString;
begin
  W := WideChar($FEFF) + '1d';
  S := WideStringAsAnsi(W);
end;
Run Code Online (Sandbox Code Playgroud)

不过,我不建议AnsiString像这样滥用.如果需要字节,则按字节操作,例如:

type
  TBytes = array of Byte;

function WideStringAsBytes(const AValue: WideString): TBytes;
begin
  SetLength(Result, Length(AValue) * SizeOf(WideChar));
  Move(PWideChar(AValue)^, PByte(Result)^, Length(Result));
end;
Run Code Online (Sandbox Code Playgroud)

var
  W: WideString;
  B: TBytes;
begin
  W := WideChar($FEFF) + '1d';
  B := WideStringAsBytes(W);
end;
Run Code Online (Sandbox Code Playgroud)