当我运行以下代码时:
if substr in movie.lowercase:
我收到以下错误
AttributeError: 'NavigableString' object has no attribute 'lowercase'
movie 来自这里:
movie = row.td.div.h4.string
我试图将其更改为(没有成功)
movie = row.td.div.h4.string.string
或者
movie = unicode(row.td.div.h4.string)
您知道如何使用lowercase方法将 NavigableString 转换为普通的 unicode 字符串吗?
我试图解析一些文本sot hat我可以urlize(包装标签)链接未格式化.这是一些示例文本:
text = '<p>This is a <a href="https://google.com">link</a>, this is also a link where the text is the same as the link: <a href="https://google.com">https://google.com</a>, and this is a link too but not formatted: https://google.com</p>'
Run Code Online (Sandbox Code Playgroud)
下面是我从迄今在这里:
from django.utils.html import urlize
from bs4 import BeautifulSoup
...
def urlize_html(text):
soup = BeautifulSoup(text, "html.parser")
textNodes = soup.findAll(text=True)
for textNode in textNodes:
urlizedText = urlize(textNode)
textNode.replaceWith(urlizedText)
return = str(soup)
Run Code Online (Sandbox Code Playgroud)
但是这也会捕获示例中的中间链接,导致它被双重包裹在<a>标签中.结果是这样的:
<p>This is a <a href="https://djangosnippets.org/snippets/2072/" target="_blank">link</a>, this is also a …Run Code Online (Sandbox Code Playgroud) 我试图刮一个有这样一节的页面:
<a name="id_631"></a>
<hr>
<div class="store-class">
<div>
<span><strong>Store City</strong</span>
</div>
<div class="store-class-content">
<p>Event listing</p>
<p>Event listing2</p>
<p>Event listing3</p>
</div>
<div>
Stuff about contact info
</div>
</div>
Run Code Online (Sandbox Code Playgroud)
该页面是这样的部分列表,区分它们的唯一方法是使用<a>标记中的name属性.
所以我想我想要那个目标然后转到next_sibling <hr>然后再到下一个兄弟来获得该<div class="store-class">部分.我想要的只是div标签中的信息.
我不确定如何将该<a>标记定位为向下移动两个兄弟姐妹.当我尝试时print(soup.find_all('a', {"name":"id_631"})),只是给我标签中的内容,这没什么.
这是我的脚本:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.tandyleather.com/en/leathercraft-classes")
soup = soup = BeautifulSoup(r.text, 'html.parser')
print(soup.find("a", id="id_631").find_next_sibling("div", class_="store-class"))
Run Code Online (Sandbox Code Playgroud)
但我得到错误:
Traceback (most recent call last):
File "tandy.py", line 8, in <module>
print(soup.find("a", id="id_631").find_next_sibling("div", class_="store-class"))
AttributeError: 'NoneType' object has no …Run Code Online (Sandbox Code Playgroud) city = soup.select('a[href="/city/london d12"]')
Run Code Online (Sandbox Code Playgroud)
上面的代码收到一条错误消息:
ValueError:不支持或无效的CSS选择器:"a [href =/city/london"
我想知道是否有解决方法或美味汤的替代品?
<a title="London" href="/city/london d12">london</a>
Run Code Online (Sandbox Code Playgroud) 我正在使用beautifulsoup4从网站中提取价格标签。我使用的代码是这个
#price
try:
price = soup.find('span',{'id':'actualprice'})
price_result= str(price.get_text())
print "Price: ",price_result
except StandardError as e:
price_result="Error was {0}".format(e)
print price_result
Run Code Online (Sandbox Code Playgroud)
输出的即时消息是一个字符串,其格式为逗号。例如
82,000,00
我想要的是:
将格式从字符串价格更改为不带逗号的整数价格,以便我可以将它们用作excel中字符串的值
我想获得一个列表,其中包含HTML文档的所有不同标记名称(标记名称字符串列表,不重复).我尝试将空条目放入soup.findall(),但这给了我整个文档.
有办法吗?
我正在尝试获取一些用于NLP工作的youtube视频的成绩单,我想我可以做得很好,但也有一些问题.例如:
from xml.etree import cElementTree as ET
from bs4 import BeautifulSoup as bs
from urllib2 import urlopen
URL = 'http://video.google.com/timedtext?lang=en&v=KDHuWxy53uM'
def make_soup(url):
html = urlopen(url).read()
return bs(html, "lxml")
soup = make_soup(URL)
takeaways = soup.findAll('text')
All_text = []
for i in takeaways:
root = ET.fromstring(str(i))
reslist = list(root.iter())
try:
result = ' '.join([element.text for element in reslist])
except:
pass
All_text.append(result)
Run Code Online (Sandbox Code Playgroud)
其中一行的示例结果:
'Let's learn a little bit\nabout the dot product.'
Run Code Online (Sandbox Code Playgroud)
这似乎可以获得成绩单,但我也得到/ n这是xml的返回字符,我也得到这个奇怪的字符代替撇号,我认为是由于编码?
谁知道我怎么能清理这两个?
import urllib.request
import urllib
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "html.parser")
print(soup.title)
Run Code Online (Sandbox Code Playgroud)
我试图去上述网站,代码不断吐出403禁止错误。
有任何想法吗?
C:\ Users \ jerem \ AppData \ Local \ Programs \ Python \ Python35-32 \ python.exe“ C:/ Users / jerem / PycharmProjects / webscraper / url scraper.py”追溯(最近一次调用):文件“ C :/ Users / jerem / PycharmProjects / webscraper / url scraper.py”,第7行,页面= urllib.request.urlopen(url)文件“ C:\ Users \ jerem \ AppData \ Local \ Programs \ Python \ Python35-32 \ lib \ urllib \ …
我正在尝试创建一个带有多个参数的函数,并返回一个可调用的lambda函数.我将这些lambda函数传递给BeautifulSoup的find_all方法以解析html.
这是我为生成lambda函数而编写的函数:
def tag_filter_function(self, name="", search_terms={}, attrs=[], **kwargs):
# filter attrs that are in the search_terms keys out of attrs
attrs = [attr for attr in attrs if attr not in search_terms.keys()]
# array of strings to compile into a lambda function
exec_strings = []
# add name search into exec_strings
if len(name) > 0:
tag_search_name = "tag.name == \"{}\"".format(name)
exec_strings.append(tag_search_name)
# add generic search terms into exec_strings
if len(search_terms) > 0:
tag_search_terms = ' and '.join(["tag.has_attr(\"{}\") and …Run Code Online (Sandbox Code Playgroud) 我在使用HTML 5.0获取在Beautifulsoup中提取属性值的正确语法时遇到困难。
因此,在出现soupHTML 5问题的情况下,我使用正确的语法隔离了标签中的出现:
tags = soup.find_all(attrs={"data-topic":"recUpgrade"})
Run Code Online (Sandbox Code Playgroud)
只取标签[1]:
date = tags[1].find(attrs={"data-datenews":True})
Run Code Online (Sandbox Code Playgroud)
日期是:
<span class="invisible" data-datenews="2018-05-25 06:02:19" data-idnews="2736625" id="horaCompleta"></span>
Run Code Online (Sandbox Code Playgroud)
但是现在我想提取日期时间为“ 2018-05-25 06:02:19”。无法获取语法。
请提供见解/帮助。
beautifulsoup ×10
python ×9
python-3.x ×2
string ×2
web-scraping ×2
html ×1
html-parsing ×1
lambda ×1
python-2.7 ×1
unicode ×1
urllib ×1