网页抓取 LinkedIn 没有给我 html....我做错了什么？

Question

网页抓取 LinkedIn 没有给我 html....我做错了什么？

Eat*_*ode 5 html python selenium beautifulsoup web-scraping

因此，我尝试抓取 LinkedIn 的“关于”页面，以获取某些公司的“特色”。当尝试用 beautiful soup 抓取 LinkedIn 时，它给了我一个访问被拒绝的错误，所以我使用一个标头来伪造我的浏览器。但是，它给出以下输出而不是相应的 HTML：

\n\nwindow.onload = function() {\n // 从 cookies 中解析跟踪代码。\n var trk = "bf";\n var trkInfo = "bf";\n var cookies = document.cookie.split ("; ");\n for (var i = 0; i < cookies.length; ++i) {\n if ((cookies[i].indexOf("trkCode=") == 0) && (cookies [i].length > 8)) {\n trk = cookies[i].substring(8);\n }\n else if ((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) {\n trkInfo = cookies[i].substring(8);\n }\n }\n\n if (window.location.protocol == "http:") {\n // 如果设置了“sl”cookie，则重定向到 https。\n for (var i = 0; i < cookies.length; ++i) {\n if ((cookies[i].indexOf(" sl=") == 0) && (cookies[i].length > 3)) {\n window.location.href = "https:" + window.location.href.substring(window.location.protocol.length) ;\n return;\n }\n }\n }\n\n // 获取新域。对于国际域名，例如\n // fr.linkedin.com，我们将其转换为www.linkedin.com\n var domain = "www.linkedin.com";\n if (domain != location.host) {\ n var subdomainIndex = location.host.indexOf(".linkedin");\n if (subdomainIndex != -1) {\n domain = "www" + location.host.substring(subdomainIndex);\n }\n } \n\n window.location.href = "https://" + 域名 + "/authwall?trk=" + trk + "&trkInfo=" + trkInfo +\n "&originalReferer=" + document.referrer.substr(0 , 200) +\n "&sessionRedirect=" +encodeURIComponent(window.location.href);\n}\n\n'

import requests
from bs4 import BeautifulSoup as BS


url = 'https://www.linkedin.com/company/biotech/'
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; 
rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": 
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
"Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", 
"DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}

response = requests.get(url, headers=headers)
print(response.content)

Run Code Online (Sandbox Code Playgroud)

我究竟做错了什么？我认为它试图检查cookies。有没有办法将其添加到我的代码中？

Answer 1

小智 -1

您需要先美化响应。

page_content = BeautifulSoup(page_response.content, "html.parser")
#we use the html parser to parse the url content and store it in a variable.
textContent = []
for i in range(0, 20):
    paragraphs = page_content.find_all("p")[i].text
    textContent.append(paragraphs)
# In my use case, I want to store the speech data I mentioned earlier.  so in this example, I loop through the paragraphs, and push them into an array so that I can manipulate and do fun stuff with the data.

Run Code Online (Sandbox Code Playgroud)

不是我的示例，但可以在这里找到 https://codeburst.io/web-scraping-101-with-python-beautiful-soup-bb617be1f486

归档时间：	6 年，7 月前
查看次数：	2573 次
最近记录：	4 年，10 月前