在我的html页面上,我有一个下拉列表:
<select name="somelist">
<option value="234234234239393">Some Text</option>
</select>
Run Code Online (Sandbox Code Playgroud)
所以得到我正在做的这个列表:
ddl = soup.findAll('select', name="somelist")
if(ddl):
???
Run Code Online (Sandbox Code Playgroud)
现在我需要这个集合/字典的帮助,我希望能够通过'Some Text'和234234234239393进行查找.
这可能吗?
在这段代码中:
soup=BeautifulSoup(program.Description.encode('utf-8'))
name=soup.find('div',{'class':'head'})
print name.string.decode('utf-8')
Run Code Online (Sandbox Code Playgroud)
当我尝试打印或保存到数据库时发生错误.
dosnt metter我在做什么:
print name.string.encode('utf-8')
Run Code Online (Sandbox Code Playgroud)
要不就
print name.string
Traceback (most recent call last):
File "./manage.py", line 16, in <module>
execute_manager(settings)
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/__init__.py", line 362, in execute_manager
utility.execute()
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/__init__.py", line 303, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/base.py", line 195, in run_from_argv
self.execute(*args, **options.__dict__)
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/base.py", line 222, in execute
output = self.handle(*args, **options)
File "/usr/local/cluster/dynamic/website/video/remmedia/management/commands/remmedia.py", line 50, in handle
self.FirstTimeLoad()
File "/usr/local/cluster/dynamic/website/video/remmedia/management/commands/remmedia.py", line 115, in FirstTimeLoad
print name.string.decode('utf-8')
File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, …Run Code Online (Sandbox Code Playgroud) 有人可以告诉我如何使用BeautifulSoup获取网页中所有图像的aboslute路径列表吗?
获取所有图像很简单.我这样做:
page_images = [image["src"] for image in soup.findAll("img")]
Run Code Online (Sandbox Code Playgroud)
......但是我在获得绝对路径方面遇到了困难.有帮助吗?
谢谢.
我已经看过一些网络广播,需要帮助才能做到这一点:我一直在使用lxml.html.雅虎最近改变了网络结构.
目标页面;
http://finance.yahoo.com/quote/IBM/options?date=1469750400&straddle=true
在使用检查器的Chrome中:我看到了数据
//*[@id="main-0-Quote-Proxy"]/section/section/div[2]/section/section/table
Run Code Online (Sandbox Code Playgroud)
那么一些代码
如何将这些数据输出到列表中.我想换成其他股票从"LLY"到"Msft"?
如何在日期之间切换....并获得所有月份.
我正在编写一个脚本,用于刮擦国际星际争霸2游戏Team Liquid数据库的游戏.(http://www.teamliquid.net/tlpd/sc2-international/games)
但是我来了一个问题.我的脚本循环遍历所有页面,但Team Liquid站点使用我认为在表中的某种AJAX来更新它.现在,当我使用BeautifulSoup时,我无法获得正确的数据.
所以我遍历这些页面:
http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-1-1-DESC
http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-2-1-DESC
http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-3-1-DESC
http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-4-1-DESC 等...
当您自己打开它们时,您会看到不同的页面,但是我的脚本每次都会保持相同的第一页.我认为这是因为当打开其他页面时,您会看到一些加载的东西,只需少量时间将游戏表更新到正确的页面.所以我想beatifulsoup是快速的,需要等待表的加载和更新.
所以我的问题是:我怎样才能确保它需要更新的表?
我现在使用此代码获取表的内容,之后我将内容放在.csv中:
html = urlopen(url).read().lower()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_key('id')
and tag['id']=="tblt_table")
rows = table.findAll(lambda tag: tag.name=='tr')
Run Code Online (Sandbox Code Playgroud) 我正在尝试将一个URL列表放入一个csv文件中,我正在使用urllib2和BeautifulSoup从网页上抓取这些文件.我尝试将链接写入csv文件作为unicode,并转换为utf-8.在这两种情况下,每个字母都插入到一个新字段中.
这是我的代码(我至少尝试过这两种方式):
f = open('filename','wb')
w = csv.writer(f,delimiter=',')
for link in links:
w.writerow(link['href'])
Run Code Online (Sandbox Code Playgroud)
和:
f = open('filename','wb')
w = csv.writer(f,delimiter=',')
for link in links:
w.writerow(link['href'].encode('utf-8'))
Run Code Online (Sandbox Code Playgroud)
links 是一个如下所示的列表:
[<a href="#Flyout1" accesskey="2" class="quicklinks" tabindex="1" title="Skip to content">Quick Links: Skip to main page content</a>, <a href="#search" class="quicklinks" tabindex="1" title="Skip to search">Skip to Search</a>, <a href="#News" class="quicklinks" tabindex="1" title="Skip to Section table of contents">Skip to Section Content Menu</a>, <a href="#footer" class="quicklinks" tabindex="1" title="Skip to site options">Skip to Common Links</a>, <a href="http://www.hhs.gov"><img src="/ucm/groups/fdagov-public/@system/documents/system/img_fdagov_hhs_gov.png" alt="www.hhs.gov link" …Run Code Online (Sandbox Code Playgroud) 我不熟悉使用python进行网络抓取,所以我不知道我是否做得对.
我正在使用一个调用BeautifulSoup的脚本来解析谷歌搜索的前10页中的URL.经过stackoverflow.com测试,开箱即用.我在另一个网站上测试了几次,试图查看该脚本是否真的与更高的谷歌页面请求一起工作,然后它对我说了503.我切换到另一个URL来测试并为一些低页面请求工作,然后也是503'd.现在我传递给它的每个URL都是503'.有什么建议?
import sys # Used to add the BeautifulSoup folder the import path
import urllib2 # Used to read the html document
if __name__ == "__main__":
### Import Beautiful Soup
### Here, I have the BeautifulSoup folder in the level of this Python script
### So I need to tell Python where to look.
sys.path.append("./BeautifulSoup")
from BeautifulSoup import BeautifulSoup
### Create opener with Google-friendly user agent
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
### Open page & generate soup …Run Code Online (Sandbox Code Playgroud) 有没有办法使用urlib, urllib2或BeautifulSoup提取HTML标签的属性?
例如:
<a href="xyz" title="xyz">xyz</a>
Run Code Online (Sandbox Code Playgroud)
得到 href=xyz, title=xyz
还有另一个讨论使用正则表达式的线程
谢谢
我使用BeautifulSoup替换html文件中的所有逗号‚.这是我的代码:
f = open(sys.argv[1],"r")
data = f.read()
soup = BeautifulSoup(data)
comma = re.compile(',')
for t in soup.findAll(text=comma):
t.replaceWith(t.replace(',', '‚'))
Run Code Online (Sandbox Code Playgroud)
此代码有效,除非html文件中包含一些javascript.在这种情况下,它甚至用javascript代码替换逗号(,).这不是必需的.我只想替换html文件的所有文本内容.
我有一个URL,我想解析其中的一部分,特别是widgetid:
<a href="http://www.somesite.com/process.asp?widgetid=4530">Widgets Rock!</a>
Run Code Online (Sandbox Code Playgroud)
我写过这篇Python(我在Python上有点新手 - 版本是2.7):
import re
from bs4 import BeautifulSoup
doc = open('c:\Python27\some_xml_file.txt')
soup = BeautifulSoup(doc)
links = soup.findAll('a')
# debugging statements
print type(links[7])
# output: <class 'bs4.element.Tag'>
print links[7]
# output: <a href="http://www.somesite.com/process.asp?widgetid=4530">Widgets Rock!</a>
theURL = links[7].attrs['href']
print theURL
# output: http://www.somesite.com/process.asp?widgetid=4530
print type(theURL)
# output: <type 'unicode'>
is_widget_url = re.compile('[0-9]')
print is_widget_url.match(theURL)
# output: None (I know this isn't the correct regex but I'd think it
# would match if there's any number in there!) …Run Code Online (Sandbox Code Playgroud) beautifulsoup ×10
python ×10
csv ×1
dictionary ×1
encoding ×1
extract ×1
html-parsing ×1
lxml ×1
regex ×1
unicode ×1
url ×1
utf-8 ×1
web-scraping ×1
yahoo ×1