我在 python 3.2 中使用 bs4 (beautifulsoup),这是我的代码:
from urllib import urlopen
from bs4 import bs4
import re
webpage = urlopen(‘http://www.azlyrics.com/lyrics/kanyewest/workoutplan.html’).read()
Run Code Online (Sandbox Code Playgroud)
它给:
webpage = urlopen(‘http://www.azlyrics.com/lyrics/kanyewest/workoutplan.html’).read()
^
SyntaxError: invalid character in identifier
Run Code Online (Sandbox Code Playgroud)
我怎样才能解决这个问题?
我正在研究一个解析 HTML 页面的项目。它适用于公司内部的网站,但我更改了示例,以便您可以尝试。
我得到一个 HTML 页面的源代码并搜索某个标记。然后我想提取这个标记的一个子字符串,但它不起作用。Python 返回一个 none... Hier 在我的代码下面,在注释中是 Python 的返回:
#!/usr/bin/python
import urllib2
from bs4 import BeautifulSoup
response = urllib2.urlopen("http://www.resto.be/restaurant/liege/4000-liege/8219-le-bar-a-gouts/")
page_source = response.read()
soup = BeautifulSoup(page_source)
name = soup.find_all("meta", attrs={"itemprop":"name"})
print(name[0])
# <meta content="LE BAR A GOUTS" itemprop="name"/>
print(name[0].find("<meta"))
# none
Run Code Online (Sandbox Code Playgroud) 我正在使用带有 Debian Linux 的 Raspberry Pi 1B+:
Linux rbian 3.18.0-trunk-rpi #1 PREEMPT Debian 3.18.5-1~exp1+rpi16 (2015-03-28) armv6l GNU/Linux
Run Code Online (Sandbox Code Playgroud)
作为更大的 Python 程序的一部分,我使用了以下代码:
#!/usr/bin/env python
import time
from urllib2 import Request, urlopen
from bs4 import BeautifulSoup
_url="http://xml.buienradar.nl/"
s1 = time.time()
req = Request(_url)
print "Request = {0}".format(time.time() - s1)
s2 = time.time()
response = urlopen(req)
print "URLopen = {0}".format(time.time() - s2)
s3 = time.time()
output = response.read()
print "Read = {0}".format(time.time() - s3)
s4 = time.time()
soup = BeautifulSoup(output)
print "Soup (1) = …Run Code Online (Sandbox Code Playgroud) 我在文件中有一些网页链接article_links.txt,我想逐个打开,提取文本,然后打印出来.我的代码是:
import requests
from inscriptis import get_text
from bs4 import BeautifulSoup
links = open(r'C:\Users\h473\Documents\Crawling\article_links.txt', "r")
for a in links:
print(a)
page = requests.get(a)
soup = BeautifulSoup(page.text, 'lxml')
html = soup.find(class_='article-wrap')
if html==None:
html = soup.find(class_='mag-article-wrap')
text = get_text(html.text)
print(text)
Run Code Online (Sandbox Code Playgroud)
但我得到一个错误说, ---> text = get_text(html.text)
AttributeError: 'NoneType' object has no attribute 'text'
所以,当我打印出soup变量以查看ts内容是什么时.这是我为每个链接找到的内容:
http://www3.asiainsurancereview.com//Mock-News-Article/id/42945/Type/eDaily/New-Zealand-Govt-starts-public-consultation-phase-of-review-of-insurance-law
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><head><title>Bad Request</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/></head>
<body><h2>Bad Request - Invalid URL</h2>
<hr/><p>HTTP Error 400. The request …Run Code Online (Sandbox Code Playgroud) 如果我使用了提取的图像网址,则需要获取宽度和高度get('width'),但这似乎不起作用
description = soup.find("div", id="module_product_detail")
img= description.find("img")
print(img.get('width'))
Run Code Online (Sandbox Code Playgroud)
输出为none。链接看起来像这样
<img alt="image" src="https://bos1.lightake.net:20011/UploadFiles/ShopSkus/1000x1000/Y2463/Y246302/sku_Y246302_1.jpg"/>
Run Code Online (Sandbox Code Playgroud) 我需要在这里使用正则表达式吗?
我想要的内容如下所示:
<meta content="text I want to grab" name="description"/>
Run Code Online (Sandbox Code Playgroud)
但是,有许多以“ meta content =“开头的对象,我想要以name =” description“结尾的对象。我是regex的新手,但我认为BS可以解决这个问题。
我下面有XML,我已经保存在名为movies.xml的文件中。我只需要将某些值转换为JSON。对于直接转换,我可以使用xmltodict。我正在使用etree和etree.XMLParser()。我尝试在此之后进行弹性搜索。我已经使用attrib方法成功提取了单个节点。
<?xml version="1.0" encoding="UTF-8" ?>
<collection>
<genre category="Action">
<decade years="1980s">
<movie favorite="True" title="Indiana Jones: The raiders of the lost Ark">
<format multiple="No">DVD</format>
<year>1981</year>
<rating>PG</rating>
<description>
'Archaeologist and adventurer Indiana Jones
is hired by the U.S. government to find the Ark of the
Covenant before the Nazis.'
</description>
</movie>
<movie favorite="True" title="THE KARATE KID">
<format multiple="Yes">DVD,Online</format>
<year>1984</year>
<rating>PG</rating>
<description>None provided.</description>
</movie>
<movie favorite="False" title="Back 2 the Future">
<format multiple="False">Blu-ray</format>
<year>1985</year>
<rating>PG</rating>
<description>Marty McFly</description>
</movie>
</decade>
<decade years="1990s">
<movie favorite="False" title="X-Men">
<format …Run Code Online (Sandbox Code Playgroud) 我正试图为影院网站制作一个刮刀,以收集电影名称列表.我试图使用BeautifulSoup来解析HTML文件,我看到每部电影都在一个名为的类中"movie-row".但是select在此类上使用该方法并未检索该站点的相应数据.我能够获得的HTML最接近的组件是父类.quickbook-section.
为什么有些HTML标签可以使用BS而其他不可用?
这是我写的代码.
def get_movies_names():
url = "https://www.yesplanet.co.il/#/buy-tickets-by-cinema?in-cinema=1025&at=2018-11-09&view-mode=list"
raw_html = util.simple_get(url)
bs = BeautifulSoup(raw_html, 'html.parser')
bs.select(".movie-row")
Run Code Online (Sandbox Code Playgroud)
(simple_get只是一个返回HTML响应内容的函数)
我正在尝试使用beautifulsoup从网页上抓取数据,并将其(最终)输出到csv中。作为第一步,我尝试获取相关表的文本。我设法做到了,但是当我重新运行它时,代码不再为我提供相同的输出:运行for循环时,它不会保存所有的12372条记录,而只是保存了最后一条。
我的代码的缩写版本是:
from bs4 import BeautifulSoup
BirthsSoup = BeautifulSoup(browser.page_source, features="html.parser")
print(BirthsSoup.prettify())
# this confirms that the soup has captured the page as I want it to
birthsTable = BirthsSoup.select('#t2 td')
# selects all the elements in the table I want
birthsLen = len(birthsTable)
# birthsLen: 12372
for i in range(birthsLen):
print(birthsTable[i].prettify())
# this confirms that the beautifulsoup tag object correctly captured all of the table
for i in range(birthsLen):
birthsText = birthsTable[i].getText()
# this was supposed to compile the text …Run Code Online (Sandbox Code Playgroud) 我正在使用硒和beautifulsoup抓取一些网页。我正在遍历一堆链接,获取信息,然后将其转储为JSON:
for event in events:
case = {'Artist': item['Artist'], 'Date': item['Date'], 'Time': item['Time'], 'Venue': item['Venue'],
'Address': item['Address'], 'Coordinates': item['Coordinates']}
item[event] = case
with open("testScrape.json", "w") as writeJSON:
json.dump(item, writeJSON, ensure_ascii=False)
Run Code Online (Sandbox Code Playgroud)
当我到达此链接时:https : //www.bandsintown.com/e/100778334-jean-deaux-music-at-rickshaw-stop?came_from=257&utm_medium=web&utm_source=home&utm_campaign=event
代码中断,出现以下错误:
Traceback (most recent call last):
File "/Users/s/PycharmProjects/hi/BandsintownWebScraper.py", line 126, in <module>
json.dump(item, writeJSON, ensure_ascii=False)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 7: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)
我尝试使用:
json.dump(item, writeJSON, ensure_ascii=False).decode('utf-8')
Run Code Online (Sandbox Code Playgroud)
和:
json.dump(item, writeJSON, ensure_ascii=False).encode('utf-8')
Run Code Online (Sandbox Code Playgroud)
没有成功。我相信是链接的ï字符导致此操作失败。谁能简要介绍一下正在发生的事情,编码/解码的含义以及如何解决此问题?提前致谢。
beautifulsoup ×10
python ×10
web-scraping ×3
html ×1
json ×1
lxml ×1
selenium ×1
urllib2 ×1
utf-8 ×1
web-crawler ×1
xml ×1