我想在"User-agent"
使用Python请求请求网页时发送一个值.我不确定是否可以将其作为标题的一部分发送,如下面的代码所示:
debug = {'verbose': sys.stderr}
user_agent = {'User-agent': 'Mozilla/5.0'}
response = requests.get(url, headers = user_agent, config=debug)
Run Code Online (Sandbox Code Playgroud)
调试信息未显示请求期间发送的标头.
在标题中发送此信息是否可以接受?如果没有,我该如何发送?
我正在尝试使用Wget下载此页面.这是页面链接:
这是我的cmd:
wget -O ebay.html --user-agent ="Mozilla/5.0(Windows NT 5.2; rv:2.0.1)Gecko/20100101 Firefox/4.0.1"" http://cgi.ebay.com/ws/eBayISAPI .dll?ViewItem&rt = nc&item = 250972882769&si = a8iGAIchyvEbn7KveYFZ5QbEE7o%3D&print = all&category = 31387 "
当我使用浏览器访问页面时它工作正常.当我使用Wget时,它会下载另一个页面,而不是原始页面.我认为问题出在用户代理上.解决方案是什么?
我正在使用requests
获取URL,例如:
while True:
try:
rv = requests.get(url, timeout=1)
doSth(rv)
except socket.timeout as e:
print e
except Exception as e:
print e
Run Code Online (Sandbox Code Playgroud)
运行一段时间后,它会退出工作状态.没有例外或任何错误,就像暂停一样.然后我通过从控制台键入Ctrl + C来停止该过程.它表明该进程正在等待数据:
.............
httplib_response = conn.getresponse(buffering=True) #httplib.py
response.begin() #httplib.py
version, status, reason = self._read_status() #httplib.py
line = self.fp.readline(_MAXLINE + 1) #httplib.py
data = self._sock.recv(self._rbufsize) #socket.py
KeyboardInterrupt
Run Code Online (Sandbox Code Playgroud)
为什么会这样?有解决方案吗?
我正在尝试从谷歌搜索结果中提取链接.Inspect元素告诉我,我感兴趣的部分有"class = r".第一个结果如下:
<h3 class="r" original_target="https://en.wikipedia.org/wiki/chocolate" style="display: inline-block;">
<a href="https://en.wikipedia.org/wiki/Chocolate"
ping="/url?sa=t&source=web&rct=j&url=https://en.wikipedia.org/wiki/Chocolate&ved=0ahUKEwjW6tTC8LXZAhXDjpQKHSXSClIQFgheMAM"
saprocessedanchor="true">
Chocolate - Wikipedia
</a>
</h3>
Run Code Online (Sandbox Code Playgroud)
要提取"href"我做:
import bs4, requests
res = requests.get('https://www.google.com/search?q=chocolate')
googleSoup = bs4.BeautifulSoup(res.text, "html.parser")
elements= googleSoup.select(".r a")
elements[0].get("href")
Run Code Online (Sandbox Code Playgroud)
但我意外得到:
'/url?q=https://en.wikipedia.org/wiki/Chocolate&sa=U&ved=0ahUKEwjHjrmc_7XZAhUME5QKHSOCAW8QFggWMAA&usg=AOvVaw03f1l4EU9fYd'
Run Code Online (Sandbox Code Playgroud)
我想要的地方:
"https://en.wikipedia.org/wiki/Chocolate"
属性"ping"似乎令人困惑.有任何想法吗?
只是寻找简单的api返回,我可以在其中输入股票代码并接收完整的公司名称:
ticker('MSFT')将返回“ Microsoft”
我编写了一个脚本来从网站中提取链接,效果很好这是源代码
import requests
from bs4 import BeautifulSoup
Web=requests.get("https://www.google.com/")
soup=BeautifulSoup(Web.text,'lxml')
for link in soup.findAll('a'):
print(link['href'])
##Out put
https://www.google.com.sa/imghp?hl=ar&tab=wi
https://maps.google.com.sa/maps?hl=ar&tab=wl
https://www.youtube.com/?gl=SA&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://calendar.google.com/calendar?tab=wc
https://www.google.com.sa/intl/ar/about/products?tab=wh
http://www.google.com.sa/history/optout?hl=ar
/preferences?hl=ar
https://accounts.google.com/ServiceLogin?hl=ar&passive=true&continue=https://www.google.com/&ec=GAZAAQ
/search?safe=strict&ie=UTF-8&q=%D9%86%D9%88%D8%B1+%D8%A7%D9%84%D8%B4%D8%B1%D9%8A%D9%81&oi=ddle&ct=174786979&hl=ar&kgmid=/m/0562zv&sa=X&ved=0ahUKEwiq8feoiqDwAhUK8BQKHc7UD7oQPQgD
/advanced_search?hl=ar-SA&authuser=0
https://www.google.com/setprefs?sig=0_mwAqJUgnrqSouOmGk0UvVz7GgkY%3D&hl=en&source=homepage&sa=X&ved=0ahUKEwiq8feoiqDwAhUK8BQKHc7UD7oQ2ZgBCAU
/intl/ar/ads/
http://www.google.com/intl/ar/services/
/intl/ar/about.html
https://www.google.com/setprefdomain?prefdom=SA&prev=https://www.google.com.sa/&sig=K_e_0jdE_IjI-G5o1qMYziPpQwHgs%3D
/intl/ar/policies/privacy/
/intl/ar/policies/terms/
Run Code Online (Sandbox Code Playgroud)
但问题是,当我将网站更改为https://www.jarir.com/时,它不起作用
import requests
from bs4 import BeautifulSoup
Web=requests.get("https://www.jarir.com/")
soup=BeautifulSoup(Web.text,'lxml')
for link in soup.findAll('a'):
print(link['href'])
#out put
#
Run Code Online (Sandbox Code Playgroud)
输出将是#