在使用 Beautifulsoup 和 HTMl 解析器时,标签被转换为小写。但是我们如何在使用 LXML 解析器时实现。在下面的情况下,我无法打印输出。但是如果我使用 html 解析器进行解析。它工作正常。任何人都可以帮我吗?
html_doc = """
<html><HEAD><title>The Dormouse's story</title></HEAD>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "xml")
print soup.head
Run Code Online (Sandbox Code Playgroud) 我们使用Beautiful Soup成功解析了许多网站,但有一些网站出现了问题。一个例子是这个页面:
我们正在为美丽的汤提供确切的来源,但它返回一个矮小的 HTML 字符串,尽管没有错误......
代码:
soup = BeautifulSoup(site_html)
print str(soup.html)
Run Code Online (Sandbox Code Playgroud)
结果:
<html class="no-js" lang="en"> <!--<![endif]--> </html>
Run Code Online (Sandbox Code Playgroud)
我试图确定是什么让它绊倒了,但是看着 html 源代码,我没有任何反应。有没有人有一些见解?
html代码:
<td class="_480u">
<div class="clearfix">
<div>
Female
</div>
</div>
</td>
Run Code Online (Sandbox Code Playgroud)
我想要值“女性”作为输出。
我试过了bs.findAll('div',{'class':'clearfix'});bs.findAll('tag',{'class':'_480u'})
但是这些类遍布我的 html 代码,输出是一个很大的列表。我想在我的搜索中加入 {td --> class = ".." 和 div --> class = ".."},这样我就可以得到女性的输出。我怎样才能做到这一点?
谢谢
我想从一个网页上抓取数据。我的代码如下所示:
grad = s.get('https://www.njuskalo.hr/prodaja-kuca/zagreb',headers=header, proxies=proxyDict)
city_soup = BeautifulSoup(grad.text, "lxml")
kvarts = city_soup.find_all(id="locationId_level_1")
print kvarts[0]
print "++++++++++++++++++++++="
for kvart in kvarts[0]:
print kvart
Run Code Online (Sandbox Code Playgroud)
结果我得到:
<option data-url-alias="/brezovica" value="1247">Brezovica</option>
<option data-url-alias="/crnomerec" value="1248">?rnomerec</option>
<option data-url-alias="/donja-dubrava" value="1249">Donja Dubrava</option>
Run Code Online (Sandbox Code Playgroud)
从那里,我需要提取data-url-alias和value。怎么做?
我正在使用beautifulsoup和html解析器进行抓取,并选择了要使用的html部分并将其保存为“容器”。
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import ssl
my_url = 'https://www._________.co.uk/'
context = ssl._create_unverified_context()
uClient = uReq(my_url, context=context)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"row"})
Run Code Online (Sandbox Code Playgroud)
当涉及到在跨度中彼此相邻的几个标签时,我面临挑战。
我可以通过使用
company_string = container.span.find_all("b")
Run Code Online (Sandbox Code Playgroud)
返回以下内容:
[<b>Company</b>, <b>Name</b>, <b>Limited</b>]
Run Code Online (Sandbox Code Playgroud)
如何抛弃标签并将它们组合成字符串,以便输出为“ Company Name Limited”?
原始html在这里:
<span class="company">
<a href="/cmp/Company-Name-Limited" onmousedown="this.href =
appendParamsOnce(this.href, 'xxxx')" rel="noopener" target="_blank">
<b>Company</b> <b>Name</b> <b>Limited</b>
</a>
</span>
Run Code Online (Sandbox Code Playgroud) 拳头我不得不说,我对使用Python进行网络抓取非常陌生。我正在尝试使用这些代码行抓取数据
import requests
from bs4 import BeautifulSoup
baseurl ='https://name_of_the_website.com'
html_page = requests.get(baseurl).text
soup = BeautifulSoup(html_page, 'html.parser')
print(soup)
Run Code Online (Sandbox Code Playgroud)
作为输出,我没有得到预期的HTML页面,但另一个HTML页面显示:内容抓取工具行为不当请使用robots.txt您的IP已受速率限制
为了检查我写的问题:
try:
page_response = requests.get(baseurl, timeout =5)
if page_response.status_code ==200:
html_page = requests.get(baseurl).text
soup = BeautifulSoup(html_page, 'html.parser')
else:
print(page_response.status_code)
except requests.Timeout as e:
print(str(e))
Run Code Online (Sandbox Code Playgroud)
然后我得到429(请求太多)。
我该怎么处理这个问题?这是否意味着我无法打印页面的HTML,是否阻止了我刮擦页面的任何内容?我应该旋转IP地址吗?
我正在尝试从futbin.com收集包含FIFA终极队队员时间序列数据的数据集。我在GitHub https://github.com/darkyin87/futbin-scraper上找到了一个脚本,该脚本 能够在给定玩家/ id的情况下,抓取玩家的当前价格:
import requests
import json
domain = 'https://www.futbin.com'
version = 19
page = 'playerPrices'
player_ids = {
'Arturo Vidal': 181872,
'Pierre-Emerick Aubameyang': 188567,
'Robert Lewandowski': 188545,
'Jerome Boateng': 183907,
'Sergio Ramos': 155862,
'Antoine Griezmann': 194765,
'David Alaba': 197445,
'Paulo Dybala': 211110,
'Radja Nainggolan': 178518
}
def fetch_prices():
ret_val = {}
for name, id in player_ids.iteritems():
url = "%s/%s/%s?player=%s" % (domain, version, page, id)
response = requests.get(url)
data = response.json()
ret_val[name] = data[str(id)]['prices']['ps']['LCPrice']
return ret_val
if __name__ …Run Code Online (Sandbox Code Playgroud) 我正在尝试使用Python中的BeautifulSoup抓取各种站点。说我有以下html摘录:
<div class="member_biography">
<h3>Biography</h3>
<span class="sub_heading">District:</span> AnyState - At Large<br/>
<span class="sub_heading">Political Highlights:</span> AnyTown City Council, 19XX-XX<br/>
<span class="sub_heading">Born:</span> June X, 19XX; AnyTown, Calif.<br/>
<span class="sub_heading">Residence:</span> Some Town<br/>
<span class="sub_heading">Religion:</span> Episcopalian<br/>
<span class="sub_heading">Family:</span> Wife, Some Name; two children<br/>
<span class="sub_heading">Education:</span> Some State College, A.A. 19XX; Some Other State College, B.A. 19XX<br/>
<span class="sub_heading">Elected:</span> 19XX<br/>
</div>
Run Code Online (Sandbox Code Playgroud)
我需要结果采用以下格式:
District: AnyState - At Large
Political Highlights: AnyTown City Council, 19XX-XX
Born: June X, 19XX; AnyTown, Calif.
Residence: Some Town
Religion: Episcopalian …Run Code Online (Sandbox Code Playgroud) 假设我有一个要导出为PDF的DataFrame。在DataFrame中,我有以下几列:代码,名称,价格,净额,销售额。每行都是一个产品。
我想将可以使用BeautifulSoup获得的图像添加到该DataFrame中的每个产品。有什么方法可以将图像添加到DataFrame吗?不是链接,只是产品的图像。
具体来说,我想要这样的事情:
码:
import pandas as pd
df = pd.DataFrame([['A231', 'Book', 5, 3, 150],
['M441', 'Magic Staff', 10, 7, 200]],
columns = ['Code', 'Name', 'Price', 'Net', 'Sales')
#Suppose this are the links that contains the imagen i want to add to the DataFrame
images = ['Link 1','Link 2']
Run Code Online (Sandbox Code Playgroud) 每当我在CMD中运行程序时,我都会出现缩进错误.对我来说,整个程序的缩进看起来很完美所以我完全不知道为什么我收到错误.
CMD错误:
scraper9.py", line 50
browser.get(url2)
^
IndentationError: unexpected unindent
Run Code Online (Sandbox Code Playgroud)
我已经完全删除了所有的缩进并重新缩进了行的行以到达当前的迭代但我仍然出错.
导入os导入sys导入csv从bs4导入BeautifulSoup导入urllib2导入xlsxwriter从selenium导入webdriver
reload(sys)
sys.setdefaultencoding("utf8")
key_stats_on_main = ["Market Cap", "PE Ratio (TTM)"]
key_stats_on_stat = ["Enterprise Value", "Trailing P/E"]
stocks_arr =[]
pfolio_file = open("tickers.csv", "r")
for line in pfolio_file:
indv_stock_arr = line.strip().split(",")
stocks_arr.append(indv_stock_arr)
print(stocks_arr)
browser = webdriver.PhantomJS()
stock_info_arr = []
for stock in stocks_arr:
stock_info = []
ticker = stock[0]
stock_info.append(ticker)
url="https://finance.yahoo.com/quote/{0}?p={0}".format(ticker)
url2="https://finance.yahoo.com/quote/{0}/key-statistics?p={0}".format(ticker)
browser.get(url)
innerHTML = browser.execute_script("return document.body.innerHTML")
soup = BeautifulSoup(innerHTML, "html.parser")
for stat in key_stats_on_main:
page_stat1 = soup.find(text = stat)
try: …Run Code Online (Sandbox Code Playgroud) beautifulsoup ×10
python ×10
web-scraping ×3
html ×2
html-parsing ×2
json ×1
lxml ×1
pandas ×1
parsing ×1
python-2.7 ×1
python-3.x ×1
selenium ×1
urllib2 ×1
xlsxwriter ×1
xml ×1