阅读网页-避免使用非标准字符的菱形/问号

Mik*_*ike 2 c# asp.net

我正在尝试阅读一个在内容中包含注册商标符号的网页,即®。但是,当我使用quickwatch并在以下示例中查看sb时,看到的是带有问号而不是®的菱形。如果我将sb序列化并通过javascript显示在另一个网页中,则会发生相同的问题。这只是该字符在我的快速监视窗口中出现的方式,还是我不正确地阅读/解码了页面?代码如下:

    const int bufSize = 4096;
    const int maxBytesToGet = 5000000;
    byte[] buf = new byte[bufSize];
    StringBuilder sb = new StringBuilder(bufSize);

    using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
    {

        using (Stream responseStream = response.GetResponseStream())
        {
            while ((bytesToGet = responseStream.Read(buf, 0, buf.Length)) != 0)
            {
                sb.Append(Encoding.UTF8.GetString(buf, 0, bytesToGet));
                if (sb.Length > maxBytesToGet) break;
            }
        }
    }
Run Code Online (Sandbox Code Playgroud)

Sam*_*eff 5

您假设响应为UTF8。您需要查看响应头,以查看实际的编码是什么。使用a StreamReader代替也会更容易Encoding.GetString

string responseText;

using (var response = (HttpWebResponse)request.GetResponse())
{
    using (Stream responseStream = response.GetResponseStream())
    {
        var encoding = Encoding.GetEncoding(response.CharacterSet);
        using(var reader = new StreamReader(responseStream, encoding))
        {
            responseText = reader.ReadToEnd();
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

  • 我最终也使用了response.CharacterSet而不是response.ContentEncoding (2认同)