在Common Lisp中一次解析一个字节的已知长度的UTF-8字符串

Question

在Common Lisp中一次解析一个字节的已知长度的UTF-8字符串

And*_*age 6 lisp binary common-lisp stream utf-8

我正在用Common Lisp编写一个程序,用于编辑由Minecraft生成的使用NBT格式的二进制文件,这里有文档记录:http://minecraft.gamepedia.com/NBT_format？cookieSetup = true (我知道存在这样的工具,比如NBTEditor和MCEdit,但两者都不是用Common Lisp编写的,我认为这个项目将是一个很好的学习练习.

到目前为止,我自己没有设法实现的唯一一项功能是读取已知长度的UTF-8字符串,该字符串包含使用多个八位字节(即非ASCII字符)表示的字符.在NBT格式中,每个字符串都是UTF-8编码的,并且前面是一个短(两个八位字节)整数,n表示字符串的长度.因此,假设字符串中只存在ASCII字符,我可以简单地n从流中读取一个八位字节序列,并使用类似的方式将其转换为字符串:

(defun read-utf-8-string (string-length byte-stream)
  (let ((seq (make-array string-length :element-type '(unsigned-byte 8)
                                       :fill-pointer t)))
    (setf (fill-pointer seq) (read-sequence seq byte-stream))
    (flexi-streams:octets-to-string seq :external-format :utf-8)))

Run Code Online (Sandbox Code Playgroud)

但是,如果一个或多个字符的字符代码大于255,则会以两个或更多字节进行编码,如下例所示:

(flexi-streams:string-to-octets "wife" :external-format :utf-8)
==> #(119 105 102 101)

(flexi-streams:string-to-octets "????" :external-format :utf-8)
==> #(208 182 208 181 208 189 208 176)

Run Code Online (Sandbox Code Playgroud)

两个字符串具有相同的长度,但俄语单词的每个字符编码为八位字节数的两倍,因此字符串的总大小是英语字符串的两倍.因此,如果使用读序列,知道字符串长度是没有用的.即使字符串的大小(即编码它所需的八位字节的数量)是已知的,仍然无法知道哪些八位字节单独转换为字符形式以及哪些组合在一起进行转换.因此,我试图找到一种方法来实现(Clozure CL)或外部库为我做的工作,而不是滚动我自己的函数.不幸的是,这也存在问题,因为我的解析器依赖于对所有读取函数使用相同的文件流,如下所示:

(with-open-file (stream "test.dat" :direction :input
                                   :element-type '(unsigned-byte 8))
  ;;Read entire contents of NBT file from stream here)

Run Code Online (Sandbox Code Playgroud)

这限制了我:element-type '(unsigned-byte 8),因此禁止我指定字符编码和使用read-char(或等效),如下所示:

(with-open-file (stream "test.dat" :external-format :utf-8)
  ...)

Run Code Online (Sandbox Code Playgroud)

本:element-type必须是'(unsigned-byte 8)这样我可以读取和写入整数和各种尺寸的浮动.为了避免必须手动将八位字节序列转换为字符串,我首先想知道在文件打开时是否有办法将元素类型更改为字符的类型,这使我在此讨论: https:// groups. google.com/forum/#!searchin/comp.lang.lisp/binary $ 20write $ 20read/comp.lang.lisp/N0IESNPSPCU/Qmcvtk0HkC0J显然,某些CL实现(如SBCL)默认使用二进制流,因此读取字节和read-char可以在同一个流上使用; 如果我采用这种方法,我仍然需要能够指定一个:external-format到stream(:utf-8),虽然这种格式只应用于读取字符时,而不是在读取原始字节时.

为了简洁起见,我在上面的例子中使用了flexi-streams中的一些函数,但是我的代码只使用了内置的流类型,而我还没有使用flexi-streams本身.这个问题是灵活流的一个很好的候选者吗？拥有一个额外的抽象层,可以让我从同一个流中互换地读取原始字节和UTF-8字符,这将是理想的选择.

熟悉flexi-streams(或其他相关方法)的人的任何建议都将非常感激.

谢谢.

Answer 1

Rai*_*wig 6

这是我写的东西:

首先,我们想知道在给定第一个字节的情况下,某些字符的编码实际上有多长.

(defun utf-8-number-of-bytes (first-byte)
  "returns the length of the utf-8 code in number of bytes, based on the first byte.
The length can be a number between 1 and 4."
  (declare (fixnum first-byte))
  (cond ((=       0 (ldb (byte 1 7) first-byte)) 1)
        ((=   #b110 (ldb (byte 3 5) first-byte)) 2)
        ((=  #b1110 (ldb (byte 4 4) first-byte)) 3)
        ((= #b11110 (ldb (byte 5 3) first-byte)) 4)
        (t (error "unknown number of utf-8 bytes for ~a" first-byte))))

Run Code Online (Sandbox Code Playgroud)

然后我们解码:

(defun utf-8-decode-unicode-character-code-from-stream (stream)
  "Decodes byte values, from a binary byte stream, which describe a character
encoded using UTF-8.
Returns the character code and the number of bytes read."
  (let* ((first-byte (read-byte stream))
         (number-of-bytes (utf-8-number-of-bytes first-byte)))
    (declare (fixnum first-byte number-of-bytes))
    (ecase number-of-bytes
      (1 (values (ldb (byte 7 0) first-byte)
                 1))
      (2 (values (logior (ash (ldb (byte 5 0) first-byte) 6)
                         (ldb (byte 6 0) (read-byte stream)))
                 2))
      (3 (values (logior (ash (ldb (byte 5 0) first-byte) 12)
                         (ash (ldb (byte 6 0) (read-byte stream)) 6)
                         (ldb (byte 6 0) (read-byte stream)))
                 3))
      (4 (values (logior (ash (ldb (byte 3 0) first-byte) 18)
                         (ash (ldb (byte 6 0) (read-byte stream)) 12)
                         (ash (ldb (byte 6 0) (read-byte stream)) 6)
                         (ldb (byte 6 0) (read-byte stream)))
                 4))
      (t (error "wrong UTF-8 encoding for file position ~a of stream ~s"
                (file-position stream)
                stream)))))

Run Code Online (Sandbox Code Playgroud)

你知道有多少个角色.N字符.您可以为N个字符分配一个支持Unicode的字符串.所以你调用函数N时间.然后,对于每个结果,将结果转换为字符并将其放入字符串中.

归档时间：	10 年，7 月前
查看次数：	820 次
最近记录：	10 年，7 月前