我正在编写一个从其他网站下载html页面的程序.我发现一个问题,对于某些特定的网站,我无法获得完整的HTML代码.我只能获得部分内容.有这个问题的服务器在"Transfer-Encoding:chunked"中发送数据,恐怕这就是问题的原因.
这是服务器返回的头信息:
Transfer-Encoding: chunked
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Content-Type: text/html; charset=UTF-8
Date: Sun, 11 Sep 2011 09:46:23 GMT
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Server: nginx/1.0.6
Run Code Online (Sandbox Code Playgroud)
这是我的代码:
HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
HttpWebResponse response;
CookieContainer cookie = new CookieContainer();
request.CookieContainer = cookie;
request.AllowAutoRedirect = true;
request.KeepAlive = true;
request.UserAgent =
@"Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2 FirePHP/0.6";
request.Accept = @"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
string html = string.Empty;
response = request.GetResponse() as HttpWebResponse;
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
html = reader.ReadToEnd();
}
Run Code Online (Sandbox Code Playgroud)
我只能获得部分HTML代码(我认为它是服务器中的第一个块).有人可以帮忙吗?任何方案?
谢谢!
您不能使用ReadToEnd来读取分块数据.您需要使用GetBytes直接从响应流中读取.
StringBuilder sb = new StringBuilder();
Byte[] buf = new byte[8192];
Stream resStream = response.GetResponseStream();
do
{
count = resStream.Read(buf, 0, buf.Length);
if(count != 0)
{
sb.Append(Encoding.UTF8.GetString(buf,0,count)); // just hardcoding UTF8 here
}
}while (count > 0);
String html = sb.ToString();
Run Code Online (Sandbox Code Playgroud)
gss*_*der -1
如果我明白你在问什么,你可以逐行阅读
string htmlLine = reader.ReadLine();
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
16702 次 |
| 最近记录: |