小编MIT_THU的帖子

使用请求登录后无法使用Selenium获取配置文件名称

我用Python编写了一个脚本,只在SO中显示我的个人资料中可见的名称.问题是我想使用请求模块登录该站点,一旦我登录,我希望使用Selenium获取配置文件名称.底线是 - 当我得到个人资料网址时,我希望Selenium重新使用该网址来获取个人资料名称.

此工作解决方案使用请求:

import requests
from bs4 import BeautifulSoup

url = "https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f"

req = requests.get(url)
sauce = BeautifulSoup(req.text,"lxml")
fkey = sauce.select_one("[name='fkey']")['value']
payload = {
    'fkey': fkey,
    'ssrc': 'head',
    'email': my_username,
    'password': my_password,
    'oauth_version':'', 
    'oauth_server':'' 
    }
res = requests.post(url,data=payload)
soup = BeautifulSoup(res.text,"lxml")
item = soup.select_one("div[class^='gravatar-wrapper-']").get("title")
print(item)

Run Code Online (Sandbox Code Playgroud)

我现在想做的是:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f"

driver = webdriver.Chrome()

req = requests.get(url)
sauce = BeautifulSoup(req.text,"lxml")
fkey = sauce.select_one("[name='fkey']")['value']
payload = {
    'fkey': fkey,
    'ssrc': …

Run Code Online (Sandbox Code Playgroud)

python selenium web-scraping python-3.x python-requests

15
推荐指数

1
解决办法

355
查看次数

使用请求无法从网页中获取某项

我已经创建了一个脚本来从网页上抓取name并email寻址。当我运行脚本时，我得到了name相应的信息，但是在email这种情况下，我得到了aeccdcd7cfc0eedadcc783cdc1dc80cdc1c3。email每次运行脚本时，我得到的字符串（而不是字符串）都会更改。

到目前为止，我已经尝试过：

import requests
from bs4 import BeautifulSoup

url = "https://www.seafoodsource.com/supplier-directory/Tri-Cor-Flexible-Packaging-Inc"

res = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,'lxml')
name = soup.select_one("[class$='-supplier-view-main-container'] > h1").text
email = soup.select_one("[class='__cf_email__']").get("data-cfemail")
print(f'{"Name: "}{name}\n{"Email: "}{email}')

Run Code Online (Sandbox Code Playgroud)

电流输出：

Name: Tri-Cor Flexible Packaging Inc
Email: aeccdcd7cfc0eedadcc783cdc1dc80cdc1c3

Run Code Online (Sandbox Code Playgroud)

预期产量：

Name: Tri-Cor Flexible Packaging Inc
Email: bryan@tri-cor.com

Run Code Online (Sandbox Code Playgroud)

PS我不追求任何与任何浏览器模拟器相关的解决方案，例如硒。

如何使用请求从该页面获取该电子邮件？

python beautifulsoup web-scraping python-3.x

10
推荐指数

2
解决办法

360
查看次数

无法让我的脚本自动生成一些要在有效负载内使用的值

我创建了一个脚本，可通过随后发送两个https请求从目标页面获取html元素。我的脚本可以完美地完成任务。但是，我必须复制chrome开发工具中的四个值以填充其中的四个键payload，以便发送最终的http请求以到达目标页面。这是起始链接，下面是有关如何到达目标页面的说明。

单击Find Hotel按钮（如果chek-out默认情况下check-in日期比日期长至少一天，则无需更改日期）。
勾选下图所示的框，然后Book Now按其上方的按钮。现在，它将引导您自动进入目标页面。
到达标题为的目标页面后Enter Guest Details，从此处解析html元素

我已经尝试过（使用一个）：

import requests
from bs4 import BeautifulSoup

url = 'https://booking.discoverqatar.qa/SearchHandler.aspx?'
second_url = 'https://booking.discoverqatar.qa/PassengerDetails.aspx?'

params = {
    'Module':'H','txtCity':'','hdnCity':'2947','txtHotel':'','hdnHotel':'',
    'fromDate':'05/11/2019','toDate':'07/11/2019','selZone':'','minSelPrice':'',
    'maxSelPrice':'','roomConfiguration':'2|0|','noOfRooms':'1',
    'hotelStandardArray':'63,60,54,50,52,51','CallFrom':'','DllNationality':'-1',
    'HdnNoOfRooms':'-1','SourceXid':'MTEzNzg=','mdx':''
}

payload = {
    'CallFrom':'MToxNjozOCBQTXxCMkN8MToxNjozOCBQTQ==',
    'Btype':'MToxNjozOCBQTXxBfDE6MTY6MzggUE0=',
    'PaxConfig':'MToxNjozOCBQTXwyfDB8MnwwfHwxOjE2OjM4IFBN',
    'usid':'MToxNjozOCBQTXxoZW54dmkzcWVnc3J3cXpld2lsa2ZwMm18MToxNjozOCBQTQ=='  
}

with requests.Session() as s:
    r = s.get(url,params=params,headers={"User-agent":"Mozilla/5.0"})
    res = s.get(second_url,params=payload,headers={
        "User-agent":"Mozilla/5.0",
        "Referer":r.url
        })
    soup = BeautifulSoup(res.text,'lxml')
    print(soup)

Run Code Online (Sandbox Code Playgroud)

在上面的脚本中，我已经复制和值粘贴CallFrom，Btype，PaxConfig并usid从开发工具中使用payload。 …

python web-scraping python-3.x

10
推荐指数

1
解决办法

279
查看次数

修改硒python绑定中的语言选项时遇到问题

我已经在python中与硒结合使用创建了一个脚本，以从Google Play商店中抓取不同的应用名称，当我执行脚本时，它们都会通过。但是，结果正在转换为我的非英语母语。

如何修改Selenium python绑定中的language选项？

我的尝试（试图更改语言选项，但失败了）：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC

link = 'https://play.google.com/store'

chrome_options = Options()
chrome_options.add_argument("accept-language=en-US")

with webdriver.Chrome(options=chrome_options) as driver:
    driver.get(link)
    for item in wait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'.details a.title'))):
        print(item.text)

Run Code Online (Sandbox Code Playgroud)

我的输出使用的是我的母语，而不是英语。

python selenium web-scraping python-3.x selenium-webdriver

9
推荐指数

1
解决办法

868
查看次数

无法解析用户名以确保我已登录网站

我已经用 python 编写了一个脚本来登录网站并解析用户名以确保我真的能够登录。使用我在下面尝试过的方式似乎可以让我到达那里。但是，我在脚本中使用了从 chrome 开发工具中获取的硬编码 cookie 来获得成功。

我试过：

import requests
from bs4 import BeautifulSoup

url = 'https://secure.imdb.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.imdb.com%2Fap-signin-handler&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.assoc_handle=imdb_pro_us&openid.mode=checkid_setup&siteState=eyJvcGVuaWQuYXNzb2NfaGFuZGxlIjoiaW1kYl9wcm9fdXMiLCJyZWRpcmVjdFRvIjoiaHR0cHM6Ly9wcm8uaW1kYi5jb20vIn0&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0'
signin = 'https://secure.imdb.com/ap/signin'
mainurl = 'https://pro.imdb.com/'

with requests.Session() as s:
    res = s.get(url,headers={"User-agent":"Mozilla/5.0"})
    soup = BeautifulSoup(res.text,"lxml")
    payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}
    payload['email'] = 'some username'
    payload['password'] = 'some password'

    s.post(signin,data=payload,headers={
        "User-agent":"Mozilla/5.0",
        "Cookie": 'adblk=adblk_yes; ubid-main=130-2884709-6520735; _msuuid_518k2z41603=95C56F3B-E3C1-40E5-A47B-C4F7BAF2FF5D; _fbp=fb.1.1574621403438.97041399; pa=BCYm5GYAag-hj1CWg3cPXjfv2X6NGPUp6kLguepMku7Yf0W9-iSTjgmVNGmQLwUfJ5XJPHqlh84f%0D%0Agrd2voq0Q7TR_rdXU4T1BJw-1a-DdvCNSVuWSm50IXJDC_H4-wM_Qli_%0D%0A; uu=BCYnANeBBdnuTg3UKEVGDiO203C7KR0AQTdyE9Y_Y70vpd04N5QZ2bD3RwWdMBNMAJtdbRbPZMpG%0D%0AbPpC6vZvoMDzucwsE7pTQiKxY24Gr4_-0ONm7hGKPfPbMwvI1NYzy5ZhTIyIUqeVAQ7geCBiS5NS%0D%0A1A%0D%0A; session-id=137-0235974-9052660; session-id-time=2205351554; session-token=jsvzgJ4JY/TCgodelKegvXcqdLyAy4NTDO5/iEvk90VA8qWWEPJpiiRYAZe3V0EYVFlKq590mXU0OU9XMbAzwyKqXIzPLzKfLf3Cc3k0g/VQNTo6roAEa5IxmOGZjWrJuhkRZ1YgeF5uPZLcatWF1y5PFHqvjaDxQrf2LZbgRXF5N7vacTZ8maK0ciJmQEjh; csm-hit=tb:8HH0DWNBDVSWP881GYKG+s-8HH0DWNBDVSWP881GYKG|1574631571950&t:1574631571952&adb:adblk_yes'
        })

    r = s.get(mainurl,headers={
        "Cookie": 'adblk=adblk_yes; ubid-main=130-2884709-6520735; _msuuid_518k2z41603=95C56F3B-E3C1-40E5-A47B-C4F7BAF2FF5D; _fbp=fb.1.1574621403438.97041399; pa=BCYm5GYAag-hj1CWg3cPXjfv2X6NGPUp6kLguepMku7Yf0W9-iSTjgmVNGmQLwUfJ5XJPHqlh84f%0D%0Agrd2voq0Q7TR_rdXU4T1BJw-1a-DdvCNSVuWSm50IXJDC_H4-wM_Qli_%0D%0A; csm-hit=tb:KV47B1QVKP4DNB3QGY95+b-NM69W1Y35R7ARV0639V5|1574631544432&t:1574631544432&adb:adblk_yes; session-id=137-0235974-9052660; session-id-time=2205351554; session-token="EsIzROiSTmFDfXd5jnBPIBOpYG9jAu7tiWXDF8R52sUw5jS6OjddfOOQB+ytCmq0K3UnXs9wKBvQtkB4aVNsXieVbRcIUrKf3iPnYeJchbOlShMjg+MR+O7IQgPKkw0BKihdYQ1YIl7KQS8VeLxZjtzJ5sj5ocnY72fCKdwq/fGOjfieFYbe9Km3a8h++1GpC738JbwcVdpTG08v1pjhQKifqPQXnqhcyVKhi8CD1qk="; x-main="C1KbtQgFFBAYfwttdRSrU5CpCe@Fn6SPHnBTY6dO2ppimt@u1P1L7G0PueQMn6X3"; at-main=Atza|IwEBICfS3UKNp2mwmbyUPY1QzjXRHMcL6fjv2ND7BDXsZ1G-qDPJKsLJXeU9gJOvRpWsofSpOJCyhnap-bIOWCutU6VMIS9bn3UkNVRP8WFVqrs-CLB5opLbrEx6YxVGQlfaxx54gzuuGO4D30z-AgBpGe64_bn0K1iLOT3P3i7S3nBzvP_0AopwKlbU7SRnE5m21cVfVK7bwbtfZO4cf7DrpGcaHK4dlY5jKHPzNx_AR4ypqsEBFbHon36N1j8foty6wLJhFP1gNCvs24mVCec24TRho5ZXFDYqhLB-dw9V3XY1eq7q1QNgtAdYkDSJ6Mq1nllFu59WqIVs1Y3lLEaxDUExLtCt-VQArpS_hZtZR8C_kevhV01jEhWg8RUQaCdYTMwZHwa778MiEOrrrdGqFnR5; sess-at-main="tWwUfkZLx+mDAPqZo+J6yJlnjqBJvYJ0oVMS6/NcIKQ="; …

Run Code Online (Sandbox Code Playgroud)

python web-scraping python-3.x

9
推荐指数

1
解决办法

748
查看次数

Can't scrape all the company names from a webpage

I'm trying to parse all the company names from this webpage. There are around 2431 companies in there. However, the way I've tried below can fetches me 1000 results.

This is what I can see about the number of results in response while going through dev tools:

hitsPerPage: 1000
index: "YCCompany_production"
nbHits: 2431      <------------------------       
nbPages: 1
page: 0

Run Code Online (Sandbox Code Playgroud)

How can I get the rest of the results using requests?

I've tried so far:

import requests

url = 'https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries?'

params …

Run Code Online (Sandbox Code Playgroud)

python web-scraping python-3.x python-requests

9
推荐指数

2
解决办法

606
查看次数

在代理中正确使用第二个参数

htt(p|ps)当我向https网站发出请求时,在代理中的第二个参数中使用的正确方法是什么？我在下面使用的代理只是一个占位符.

当我尝试这样(它的工作原理):

proxies = {
  'https': 'http://79.170.192.143:34394',
}

Run Code Online (Sandbox Code Playgroud)

当我尝试这样(它也有效):

proxies = {
  'https': 'https://79.170.192.143:34394',
}

Run Code Online (Sandbox Code Playgroud)

htt(p|ps)代理中的第二个只是占位符http吗？如果我向网站发出请求怎么办？

python proxy web-scraping python-3.x

8
推荐指数

1
解决办法

291
查看次数

无法使用请求解析网页的确切结果

我已经在python中创建了一个脚本来解析网页中的两个字段- total revenue这很令人担忧date。我关注的字段是javascript加密的。它们在json数组中的页面源中可用。以下脚本可以相应地解析这两个字段。

但是，问题在于该页面中的可见日期与页面源中可用的日期不同。

该网页中的日期是这样的

页面源中的日期是这样的

显然会有一天的变化。

当您单击该选项卡访问该网页后，Quarterly您可以在其中看到结果：

我尝试过：

import re
import json
import requests

url = 'https://finance.yahoo.com/quote/GTX/financials?p=GTX'

res = requests.get(url)
data = re.findall(r'root.App.main[^{]+(.*);',res.text)[0]
jsoncontent = json.loads(data)
container = jsoncontent['context']['dispatcher']['stores']['QuoteSummaryStore']['incomeStatementHistoryQuarterly']['incomeStatementHistory']
total_revenue = container[0]['totalRevenue']['raw']
concerning_date = container[0]['endDate']['fmt']
print(total_revenue,concerning_date)

Run Code Online (Sandbox Code Playgroud)

我得到的结果（以百万为单位的收入）：

802000000 2019-06-30

Run Code Online (Sandbox Code Playgroud)

结果我希望得到：

802000000 2019-06-29

Run Code Online (Sandbox Code Playgroud)

当我尝试使用此行情自动AAPL收录器时，我会得到确切的日期，因此不能选择跟踪或添加日期。

如何从该站点获取确切日期？

顺便说一句，我知道如何使用硒来获得它们，所以我只想坚持requests。

python json web-scraping python-3.x python-requests

8
推荐指数

1
解决办法

352
查看次数

无法使用请求登录 Instagram

我正在尝试使用requests库登录 Instagram 。我成功地使用了以下脚本，但是它不再起作用了。密码字段被加密（手动登录时检查开发工具）。

我试过了：

import re
import requests
from bs4 import BeautifulSoup

link = 'https://www.instagram.com/accounts/login/'
login_url = 'https://www.instagram.com/accounts/login/ajax/'

payload = {
    'username': 'someusername',
    'password': 'somepassword',
    'enc_password': '',
    'queryParams': {},
    'optIntoOneTap': 'false'
}

with requests.Session() as s:
    r = s.get(link)
    csrf = re.findall(r"csrf_token\":\"(.*?)\"",r.text)[0]
    r = s.post(login_url,data=payload,headers={
        "user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36",
        "x-requested-with": "XMLHttpRequest",
        "referer": "https://www.instagram.com/accounts/login/",
        "x-csrftoken":csrf
    })
    print(r.status_code)
    print(r.url)

Run Code Online (Sandbox Code Playgroud)

我发现使用开发工具：

username: someusername
enc_password: #PWD_INSTAGRAM_BROWSER:10:1592421027:ARpQAAm7pp/etjy2dMjVtPRdJFRPu8FAGILBRyupINxLckJ3QO0u0RLmU5NaONYK2G0jQt+78BBDBxR9nrUsufbZgR02YvR8BLcHS4uN8Gu88O2Z2mQU9AH3C0Z2NpDPpS22uqUYhxDKcYS5cA==
queryParams: {"oneTapUsers":"[\"36990119985\"]"}
optIntoOneTap: false

Run Code Online (Sandbox Code Playgroud)

如何使用请求登录 Instagram？

python beautifulsoup python-3.x instagram

8
推荐指数

3
解决办法

3522
查看次数

Trouble parsing tabular items from a graph located in a website

I'm trying to extract the tabular contents available on a graph in a webpage. The content of those tables are only visible when someone hovers his cursor within the area. One such table is this one.

Webpage address

表格在其中的标题为EPS consensus revisions : last 18 months。

到目前为止，我已经尝试过：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://www.marketscreener.com/SUNCORP-GROUP-LTD-6491453/revisions/"

driver = webdriver.Chrome()
driver.get(link)
wait = WebDriverWait(driver, 10)
for items …

Run Code Online (Sandbox Code Playgroud)

python selenium web-scraping python-3.x selenium-webdriver

7
推荐指数

1
解决办法

212
查看次数

标签统计

python-3.x ×10

web-scraping ×9

python-requests ×3

beautifulsoup ×2

selenium-webdriver ×2

json ×1

«
1
2
3
»