gar*_*man 5 html c# response decoding
我有这段代码可以从 URL 获取页面 HTML,但是响应内容看起来是经过编码的。
代码:
HttpWebRequest xhr = (HttpWebRequest) WebRequest.Create(new Uri("https://www.youtube.com/watch?v=_Ewh75YGIGQ"));
xhr.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
//xhr.CookieContainer = request.Account.CookieContainer;
xhr.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
xhr.Headers["Accept-Encoding"] = "gzip, deflate, br";
xhr.Headers["Accept-Language"] = "en-US,en;q=0.5";
xhr.Headers["Upgrade-Insecure-Requests"] = "1";
xhr.KeepAlive = true;
xhr.UserAgent = "Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)";
xhr.Host = "www.youtube.com";
xhr.Referer = "https://www.youtube.com/watch?v=6aCpYxzRkf4";
var response = xhr.GetResponse();
string html;
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
html = reader.ReadToEnd();
}
Run Code Online (Sandbox Code Playgroud)
这些是响应标头:
X-XSS-Protection: 1; mode=block; report=https://www.google.com/appserve/security-bugs/log/youtube
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
Strict-Transport-Security: max-age=31536000
Content-Encoding: br
Transfer-Encoding: chunked
Alt-Svc: quic=":443"; ma=2592000; v="44,43,39,35"
Cache-Control: no-cache
Content-Type: text/html; charset=utf-8
Date: Sat, 24 Nov 2018 11:30:38 GMT
Expires: Tue, 27 Apr 1971 19:44:06 EST
P3P: CP="This is not a P3P policy! See http://support.google.com/accounts/answer/151657?hl=it for more info."
Set-Cookie: PREF=f1=50000000&al=it; path=/; domain=.youtube.com; expires=Thu, 25-Jul-2019 23:23:38 GMT
Server: YouTube Frontend Proxy
Run Code Online (Sandbox Code Playgroud)
解析的响应字符串看起来StreamReader.ReadToEnd()像这样
是的..上面的答案是正确的。服务器生成的响应采用 br 编码。你需要解码它。默认系统压缩包中不包含对 br 编码的支持,您必须安装 Brotli.net nuget 包。
将其添加到您的代码中以涵盖 3 种主要编码类型 gzip、br 和 defalte
HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse();
Stream responseStream = response.GetResponseStream();
if (response.ContentEncoding.ToLower().Contains("gzip"))
responseStream = new GZipStream(responseStream, CompressionMode.Decompress);
else if (response.ContentEncoding.ToLower().Contains("deflate"))
responseStream = new DeflateStream(responseStream, CompressionMode.Decompress);
else if (response.ContentEncoding.ToLower().Contains("br"))
responseStream = new BrotliStream(responseStream, CompressionMode.Decompress);
Run Code Online (Sandbox Code Playgroud)
答案在响应头中: Content-Encoding: br -> 这意味着 Brotli 压缩。
有一个 .NET 实现(NuGet 包):
将其安装到您的项目中,添加“using Brotli;”并将“using (StreamReader.....”替换为以下代码:
using (BrotliStream bs = new BrotliStream(response.GetResponseStream(), System.IO.Compression.CompressionMode.Decompress)) {
using (System.IO.MemoryStream msOutput = new System.IO.MemoryStream()) {
bs.CopyTo(msOutput);
msOutput.Seek(0, System.IO.SeekOrigin.Begin);
using (StreamReader reader = new StreamReader(msOutput)) {
html = reader.ReadToEnd();
}
}
}
Run Code Online (Sandbox Code Playgroud)