如何从无限字节流中读取UTF-8字符 - C#

Mik*_*low 6 c# stream

通常,要从字节流中读取字符,请使用StreamReader.在这个例子中,我正在从无限流中读取由'\ r'分隔的记录.

using(var reader = new StreamReader(stream, Encoding.UTF8))
{
    var messageBuilder = new StringBuilder();
    var nextChar = 'x';
    while (reader.Peek() >= 0)
    {
        nextChar = (char)reader.Read()
        messageBuilder.Append(nextChar);

        if (nextChar == '\r')
        {
            ProcessBuffer(messageBuilder.ToString());
            messageBuilder.Clear();
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

问题是,StreamReader的有一个小的内部缓冲区,因此,如果等待(在这种情况下,"\ r")分隔符的"记录结束"的代码它必须等待,直到StreamReader的内部缓冲区被刷新(通常是因为更多的字节已经到了).

此替代实现适用于单字节UTF-8字符,但在多字节字符上将失败.

int byteAsInt = 0;
var messageBuilder = new StringBuilder();
while ((byteAsInt = stream.ReadByte()) != -1)
{
    var nextChar = Encoding.UTF8.GetChars(new[]{(byte) byteAsInt});
    Console.Write(nextChar[0]);
    messageBuilder.Append(nextChar);

    if (nextChar[0] == '\r')
    {
        ProcessBuffer(messageBuilder.ToString());
        messageBuilder.Clear();
    }
}
Run Code Online (Sandbox Code Playgroud)

如何修改此代码以使其适用于多字节字符?

Ric*_*ard 10

Encoding.UTF8.GetChars获取实例Decoder并重复调用其成员方法,而不是将其设置为转换完整缓冲区,GetChars这将使用Decoder内部缓冲区来处理从一次调用结束到下一次调用的部分多字节序列.


Mik*_*low 5

感谢理查德,我现在有一个工作无限的流阅读器.正如他解释的那样,诀窍是使用Decoder实例并调用其GetChars方法.我用多字节日文文本测试它,它工作正常.

int byteAsInt = 0;
var messageBuilder = new StringBuilder();
var decoder = Encoding.UTF8.GetDecoder();
var nextChar = new char[1];

while ((byteAsInt = stream.ReadByte()) != -1)
{
    var charCount = decoder.GetChars(new[] {(byte) byteAsInt}, 0, 1, nextChar, 0);
    if(charCount == 0) continue;

    Console.Write(nextChar[0]);
    messageBuilder.Append(nextChar);

    if (nextChar[0] == '\r')
    {
        ProcessBuffer(messageBuilder.ToString());
        messageBuilder.Clear();
    }
}
Run Code Online (Sandbox Code Playgroud)