Python web抓取javascript生成的内容

Nic*_*ick 7 javascript python web-scraping scrape

我正在尝试使用python3来返回由http://www.doi2bib.org/生成的bibtex引文.url是可预测的,因此脚本可以在不必与网页交互的情况下计算出url.我尝试过使用selenium,bs4等,但无法获取文本框内的文字.

url = "http://www.doi2bib.org/#/doi/10.1007/s00425-007-0544-9"
import urllib.request
from bs4 import BeautifulSoup
text = BeautifulSoup(urllib.request.urlopen(url).read())
print(text)
Run Code Online (Sandbox Code Playgroud)

任何人都可以建议在python中将bibtex引用作为字符串(或其他)返回的方法吗?

ale*_*cxe 10

你不需要BeautifulSoup这里.还有一个额外的XHR请求发送到服务器以填写bibtex引文,模拟它,例如,使用requests:

import requests

bibtex_id = '10.1007/s00425-007-0544-9'

url = "http://www.doi2bib.org/#/doi/{id}".format(id=bibtex_id)
xhr_url = 'http://www.doi2bib.org/doi2bib'

with requests.Session() as session:
    session.get(url)

    response = session.get(xhr_url, params={'id': bibtex_id})
    print(response.content)
Run Code Online (Sandbox Code Playgroud)

打印:

@article{Burgert_2007,
    doi = {10.1007/s00425-007-0544-9},
    url = {http://dx.doi.org/10.1007/s00425-007-0544-9},
    year = 2007,
    month = {jun},
    publisher = {Springer Science $\mathplus$ Business Media},
    volume = {226},
    number = {4},
    pages = {981--987},
    author = {Ingo Burgert and Michaela Eder and Notburga Gierlinger and Peter Fratzl},
    title = {Tensile and compressive stresses in tracheids are induced by swelling based on geometrical constraints of the wood cell},
    journal = {Planta}
}
Run Code Online (Sandbox Code Playgroud)

你也可以解决它selenium.这里的关键技巧是使用显式等待等待引文变为可见:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get('http://www.doi2bib.org/#/doi/10.1007/s00425-007-0544-9')

element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//pre[@ng-show="bib"]')))
print(element.text)

driver.close()
Run Code Online (Sandbox Code Playgroud)

打印与上述解决方案相同.

  • @请确保,打开浏览器开发人员工具->“网络”选项卡。转到网站,查看页面加载时发送到服务器的所有请求。在其他内容中,您会看到我提到的内容。希望能有所帮助。 (2认同)