在BeautifulSoup中,您可以使用soup.find_all进行搜索。例如,我使用搜索了一个页面
soup.find_all("tr", "cat-list-row1")
Run Code Online (Sandbox Code Playgroud)
显然,这带来了每个名为cat-list-row1的tr类。我想知道是否有可能在整个页面中搜索名为“ cat-list-row1”的任何类,而不是仅将其限制为元素为“ tr”的类。
我希望使用以下代码编写beautifulsoup表单:
soup = BeautifulSoup(con.content)
f = open('/*/*/Desktop/littletext.rtf','w')
f.write(str(soup))
f.close()
Run Code Online (Sandbox Code Playgroud)
我收到此错误:
追溯(最近一次呼叫最近):文件“ / / /Desktop/test123.py”,行10,在f.write(soup)中TypeError:必须为str,而不是BeautifulSoup
任何想法如何解决这一问题?我试图将'soup'转换为字符串但没有用-f.write(str(soup))
我有一个在我的系统上运行的webscraper,我想将它迁移到PythonAnywhere,但是当我移动它现在它不起作用.
恰好sendkeys似乎不起作用 - 在执行以下代码后,我从未进入下一个网页,因此属性错误会被触发.
我的代码:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import csv
import time
# Lists for functions
parcel_link =[]
token = []
csv_output = [ ]
# main scraping function
def getLinks(link):
# Open web browser and get url - 3 second time delay.
#Open web browser and get url - 3 second time delay.
driver.get(link)
time.sleep(3)
inputElement …Run Code Online (Sandbox Code Playgroud) 我有类型的数据
<preference>
<name>throttle_scan</name>
<value>yes</value>
</preference>
<preference><name>listen_address</name>
<value>0.0.0.0</value>
</preference>
Run Code Online (Sandbox Code Playgroud)
这些本质上是我想用BeautifulSoup提取的名称/值对.
我设法提取了一份清单 preference
soup = bs4.BeautifulSoup(string_with_xml, 'html.parser')
for p in soup.find_all('preference'):
c = p.contents
print(c)
Run Code Online (Sandbox Code Playgroud)
给出一个列表preference:
[<name>throttle_scan</name>, '\n', <value>yes</value>, '\n']
[<name>listen_address</name>, '\n', <value>0.0.0.0</value>, '\n']
Run Code Online (Sandbox Code Playgroud)
如何进一步深入查看此列表?我应该去看一个新的
soup = bs4.BeautifulSoup(''.join(c), 'html.parser')
Run Code Online (Sandbox Code Playgroud)
和搜索name和value?
这是我要解析的示例表:
<table>
<tr>
<td>1-1</td>
<td>1-2</td>
</tr>
<tr>
<td>2-1</td>
<td>2-2</td>
</tr>
<tr>
<td>3-1</td>
<td>3-2</td>
</tr>
</table>
Run Code Online (Sandbox Code Playgroud)
我想找到这个表中的最后一个tr元素.
这样做的规范方法是什么BeautifulSoup?
我有一个项目,我必须刮掉50名演员/女演员的所有评级,这意味着我必须访问并刮掉3500个网页.这比预期的要长,我正在寻找一种加快速度的方法.我知道有像scrapy这样的框架,但我想在没有任何其他模块的情况下工作.是否有一种快速简便的方法来重写我的代码,或者这需要花费太多时间?我的代码如下:
def getMovieRatingDf(movie_links):
counter = -1
movie_name = []
movie_rating = []
movie_year = []
for movie in movie_links.tolist()[0]:
counter += 1
request = requests.get('http://www.imdb.com/' + movie_links.tolist()[0][counter])
film_soup = BeautifulSoup(request.text, 'html.parser')
if (film_soup.find('div', {'class': 'title_wrapper'}).find('a').text).isdigit():
movie_year.append(int(film_soup.find('div', {'class': 'title_wrapper'}).find('a').text))
# scrap the name and year of the current film
movie_name.append(list(film_soup.find('h1'))[0])
try:
movie_rating.append(float(film_soup.find('span', {'itemprop': 'ratingValue'}).text))
except AttributeError:
movie_rating.append(-1)
else:
continue
rating_df = pd.DataFrame(data={"movie name": movie_name, "movie rating": movie_rating, "movie year": movie_year})
rating_df = rating_df.sort_values(['movie rating'], ascending=False)
return rating_df
Run Code Online (Sandbox Code Playgroud) 我正在尝试这样做:
req = urllib.request.Request("http://en.wikipedia.org/wiki/Philosophy")
content = urllib.request.urlopen(req).read()
soup = bs4.BeautifulSoup(content, "html.parser")
content = strip_brackets(soup.find('div', id="bodyContent").p)
for link in bs4.BeautifulSoup(content, "html.parser").findAll("a"):
print(link.get("href"))
Run Code Online (Sandbox Code Playgroud)
如果我改为这样做循环:
for link in soup.findAll("a"):
print(link.get("href"))
Run Code Online (Sandbox Code Playgroud)
我不再遇到错误,但是我想先除去内容的括号,然后再获得内容的所有链接。
错误(第36行是for循环的行):
Traceback (most recent call last):
File "....py", line 36, in <module>
for link in bs4.BeautifulSoup(content, "html.parser").findAll("a"):
File "C:\Users\...\AppData\Local\Programs\Python\Python35-32\lib\site-packages\bs4\__init__.py", line 191, in __init__
markup = markup.read()
TypeError: 'NoneType' object is not callable
Run Code Online (Sandbox Code Playgroud)
我究竟做错了什么?
我想使用名为BeautifulSoup的库来抓取网站的内容。
码:
from bs4 import BeautifulSoup
from urllib.request import urlopen
html_http_response = urlopen("http://www.airlinequality.com/airport-reviews/jeddah-airport/")
data = html_http_response.read()
soup = BeautifulSoup(data, "html.parser")
print(soup.prettify())
Run Code Online (Sandbox Code Playgroud)
输出:
<html style="height:100%">
<head>
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<meta content="telephone=no" name="format-detection"/>
<meta content="initial-scale=1.0" name="viewport"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
</head>
<body style="margin:0px;height:100%">
<iframe frameborder="0" height="100%" marginheight="0px" marginwidth="0px" src="/_Incapsula_Resource?CWUDNSAI=9&xinfo=9-57435048-0%200NNN%20RT%281512733380259%202%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c315%2c0%29%20U19&incident_id=466002040110357581-305794245507288265&edet=12&cinfo=04000000" width="100%">
Request unsuccessful. Incapsula incident ID: 466002040110357581-305794245507288265
</iframe>
</body>
</html>
Run Code Online (Sandbox Code Playgroud)
从浏览器检查内容时,主体包含iFrame balise,而不是显示的内容。
我想抓一个网站的喜欢.使用BeautifulSoup,这是我到目前为止所得到的:
user = 'LazadaMalaysia'
url = 'https://www.facebook.com/'+ user
response = requests.get(url)
soup = BeautifulSoup(response.content,'lxml')
f = soup.find('div', attrs={'class': '_4bl9'})
Run Code Online (Sandbox Code Playgroud)
我收到的f输出如下:
<div class="_4bl9 _3bcp"><div aria-keyshortcuts="Alt+/" aria-label="Pembantu Navigasi" class="_6a _608n" id="u_0_8" role="menubar"><div class="_6a uiPopover" id="u_0_9"><a aria-expanded="false" aria-haspopup="true" class="_42ft _4jy0 _55pi _2agf _4o_4 _63xb _p _4jy3 _517h _51sy" href="#" id="u_0_a" rel="toggle" role="button" style="max-width:200px;"><span class="_55pe">Bahagian-bahagian pada halaman ini</span><span class="_4o_3 _3-99"><i class="img sp_m7lN5cdLBIi sx_d3bfaf"></i></span></a></div><div class="_6a _3bcs"></div><div class="_6a mrm uiPopover" id="u_0_b"><a aria-expanded="false" aria-haspopup="true" class="_42ft _4jy0 _55pi _2agf _4o_4 _3_s2 _63xb _p _4jy3 _4jy1 selected _51sy" href="#" …Run Code Online (Sandbox Code Playgroud) 我在python中编写了一个脚本,从网页中获取每个容器中的一些属性titles及其相应的email地址.当我运行我的脚本时,它只抓取titles但是在email address它刮擦的情况下只有这个文本连接到send eamil按钮.我怎样才能找到那些email addresses存在的东西,因为当我按下它时send email button,它会发送电子邮件.任何有关这方面的帮助将受到高度赞赏.
链接到该网站
这是我到目前为止所尝试的:
import requests
from bs4 import BeautifulSoup
URL = "use_above_link"
def Get_Leads(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select(".media"):
title = items.select_one(".item-name").text.strip()
try:
email = items.select_one("a[alt^='Contact']").text.strip()
except:
email = ""
print(title,email)
if __name__ == '__main__':
Get_Leads(URL)
Run Code Online (Sandbox Code Playgroud)
结果我喜欢:
Singapore Immigration Specialist SEND EMAIL
Faithful+Gould Pte Ltd SEND EMAIL
PsyAsia International SEND EMAIL
Activpayroll SEND EMAIL
Precursor …Run Code Online (Sandbox Code Playgroud) beautifulsoup ×10
python ×9
web-scraping ×4
html ×2
html-parsing ×2
python-3.x ×2
facebook ×1
html-table ×1
selenium ×1