Jak*_*kal 5 c# localhost web-scraping http-status-code-403 html-agility-pack
我下面的代码使用 C# 和 HTMLAgilityPack 抓取网页,然后使用 WebClient 从另一个网页下载字符串。这在 localhost 上效果很好,但是当我在 Azure 上将代码发布为 API 服务或在 Web 托管服务(即主机 gator)上执行时,我总是收到 403 禁止错误。我已经尝试了很多方法来让它发挥作用,但我一生都无法弄清楚这一点。任何帮助将不胜感激。
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("https://antenati.cultura.gov.it/ark:/12657/an_ud18290200");
//string returnedResult = doc.DocumentNode.OuterHtml; //this shows a 403 forbidden error response when not running from localhost.
string ress = doc.DocumentNode.SelectSingleNode("//*[text()[contains(., 'manifestId:')]]").InnerText;
if (!string.IsNullOrEmpty(ress))
{
string[] strPieces = ress.Split(new string[] { "manifestId:" }, StringSplitOptions.None);
if (strPieces.Length >= 2)
{
WebClient wb = new WebClient();
string manifestUrl = strPieces[1].Split(',')[0].Replace("'", "").Trim();
wb.Headers.Add("origin", "https://antenati.cultura.gov.it");
wb.Headers.Add("referer", "https://antenati.cultura.gov.it/");
wb.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36");
string result = wb.DownloadString(manifestUrl);
}
}
Run Code Online (Sandbox Code Playgroud)
我尝试过的代码在https://dotnetfiddle.net上导致 403 错误:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("https://antenati.cultura.gov.it/ark:/12657/an_ud18290200");
//string returnedResult = doc.DocumentNode.OuterHtml; //this shows a 403 forbidden error response when not running from localhost.
string ress = doc.DocumentNode.SelectSingleNode("//*[text()[contains(., 'manifestId:')]]").InnerText;
if (!string.IsNullOrEmpty(ress))
{
string[] strPieces = ress.Split(new string[] { "manifestId:" }, StringSplitOptions.None);
if (strPieces.Length >= 2)
{
WebClient wb = new WebClient();
string manifestUrl = strPieces[1].Split(',')[0].Replace("'", "").Trim();
wb.Headers.Add("origin", "https://antenati.cultura.gov.it");
wb.Headers.Add("referer", "https://antenati.cultura.gov.it/");
wb.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36");
string result = wb.DownloadString(manifestUrl);
}
}
Run Code Online (Sandbox Code Playgroud)
raj*_*jee -1
导致错误的主要原因有2个403:
403“禁止的 HTTP 状态代码”作为禁止页面。当您尝试抓取受 Cloudflare 保护的网站时,这些错误很常见,因为 Cloudflare 返回403状态代码。
有几个简单的解决方案:
解决方案:
{'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'}。当大规模抓取时,我们需要维护大量的用户代理列表,并为每个请求选择一个不同的用户代理。 {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
}
Run Code Online (Sandbox Code Playgroud)
[
'http://Username:Password@IP1:20000',
'http://Username:Password@IP2:20000',
'http://Username:Password@IP3:20000',
'http://Username:Password@IP4:20000', ]
]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
840 次 |
| 最近记录: |