我试图从这个网站解析信息(html表):http://www.511virginia.org/RoadConditions.aspx?j = All&r = 1
目前我正在使用BeautifulSoup,我的代码看起来像这样
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
mech = Browser()
url = "http://www.511virginia.org/RoadConditions.aspx?j=All&r=1"
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
table = soup.find("table")
rows = table.findAll('tr')[3]
cols = rows.findAll('td')
roadtype = cols[0].string
start = cols.[1].string
end = cols[2].string
condition = cols[3].string
reason = cols[4].string
update = cols[5].string
entry = (roadtype, start, end, condition, reason, update)
print entry
Run Code Online (Sandbox Code Playgroud)
问题在于开始和结束列.它们只是打印为"无"
输出:
(u'Rt. 613N (Giles County)', None, None, u'Moderate', u'snow or ice', …Run Code Online (Sandbox Code Playgroud) 我尝试过soup.find('! - '),但它似乎不起作用.提前致谢.
编辑:感谢您提供有关如何查找所有评论的提示.我有一个跟进问题.我如何专门搜索评论?
例如,我有以下评论标记:
<!-- <span class="titlefont"> <i>Wednesday 110518</i>(05:00PM)<br /></span> -->
我真的只想要这些东西<i>Wednesday 110518</i>."110518"是YYMMDD的日期,我倾向于将其用作我的搜索目标.但是,我不知道如何在特定注释标签中找到某些内容.
我正在尝试将一些字符串写入文件(字符串已由HTML解析器BeautifulSoup提供给我).
我可以使用"print"来显示它们,但是当我使用file.write()时,我收到以下错误:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 6: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)
我怎么解析这个?
这是代码:
print '"' + title.decode('utf-8', errors='ignore') + '",' \
' "' + title.decode('utf-8', errors='ignore') + '", ' \
'"' + desc.decode('utf-8', errors='ignore') + '")'
Run Code Online (Sandbox Code Playgroud)
标题和desc由Beautiful Soup 3(p [0] .text和p [0] .prettify)返回,据我所知,BeautifulSoup3文档是UTF-8编码的.
如果我跑
python.exe script.py > out.txt
Run Code Online (Sandbox Code Playgroud)
我收到以下错误:
Traceback (most recent call last):
File "script.py", line 70, in <module>
'"' + desc.decode('utf-8', errors='ignore') + '")'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 264
: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)
但是,如果我跑
python.exe script.py
Run Code Online (Sandbox Code Playgroud)
我没有错.仅在指定输出文件时才会发生.
如何在输出文件中获得良好的UTF-8数据?
我做了sudo pip install BeautifulSoup4并得到了非常乐观的回应:
Downloading/unpacking beautifulsoup4
Running setup.py egg_info for package beautifulsoup4
Installing collected packages: beautifulsoup4
Running setup.py install for beautifulsoup4
Successfully installed beautifulsoup4
Cleaning up..
Run Code Online (Sandbox Code Playgroud)
但是当我尝试使用import BeautifulSoup4或from BeautifulSoup4 import BeautifulSoup4在脚本中时,python说这个名称没有模块.
> import BeautifulSoup
ImportError: No module named BeautifulSoup
Run Code Online (Sandbox Code Playgroud)
更新:pip告诉我,beautifulsoup4 in /usr/local/lib/python2.6/dist-packages但我正在运行2.7.2+(并print sys.path看到2.7路径)...所以现在我需要弄清楚为什么pip把东西放在错误的地方.
我正在下载以下列方式定义数据的HTML页面:
... <script type= "text/javascript"> window.blog.data = {"activity":{"type":"read"}}; </script> ...
Run Code Online (Sandbox Code Playgroud)
我想提取'window.blog.data'中定义的JSON对象.有没有比手动解析更简单的方法?(我正在寻找美丽的肥皂,但似乎无法找到一个方法,将返回确切的对象而不解析)
谢谢
编辑: 使用python无头浏览器(例如,Ghost.py)执行此操作是否可行且更正确?
我有下面的脚本,它修改hrefHTML文件中的属性(将来,它将是目录中的HTML文件列表).使用BeautifulSoup我设法访问标记值并按我的意愿修改它们,但我不知道如何保存对文件所做的更改.
import os
import re
from bs4 import BeautifulSoup
htmlDoc = open('adding_computer_c.html',"r+")
soup = BeautifulSoup(htmlDoc)
replacements= [ ('_', '-'), ('../tasks/', prefixUrl), ('../concepts/', prefixUrl) ]
for link in soup.findAll('a', attrs={'href': re.compile("../")}):
newlink=str(link)
for k, v in replacements:
newlink = newlink.replace(k, v)
extrachars=newlink[newlink.find("."):newlink.find(">")]
newlink=newlink.replace(extrachars,'')
link=newlink
print(link)
##How do I save the link I have modified back to the HTML file?
print(soup)##prints the original html tree
htmlDoc.close()
Run Code Online (Sandbox Code Playgroud) 我正在尝试用BeautifulSoup创建一个表刮.我写了这个Python代码:
import urllib2
from bs4 import BeautifulSoup
url = "http://dofollow.netsons.org/table1.htm" # change to whatever your url is
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
for i in soup.find_all('form'):
print i.attrs['class']
Run Code Online (Sandbox Code Playgroud)
我需要刮Nome,Cognome,Email.
问题:
BeautifulSoup为CSS选择器提供非常有限的支持.例如,唯一支持的伪类是nth-of-type,它只能接受数值 - 参数喜欢even或odd不允许.
是否可以扩展BeautifulSoupCSS选择器或让它在lxml.cssselect内部用作底层CSS选择机制?
我们来看一个示例问题/用例.在以下HTML中仅查找偶数行:
<table>
<tr>
<td>1</td>
<tr>
<td>2</td>
</tr>
<tr>
<td>3</td>
</tr>
<tr>
<td>4</td>
</tr>
</table>
Run Code Online (Sandbox Code Playgroud)
在lxml.html和中lxml.cssselect,很容易做到:nth-of-type(even):
from lxml.html import fromstring
from lxml.cssselect import CSSSelector
tree = fromstring(data)
sel = CSSSelector('tr:nth-of-type(even)')
print [e.text_content().strip() for e in sel(tree)]
Run Code Online (Sandbox Code Playgroud)
但是,在BeautifulSoup:
print(soup.select("tr:nth-of-type(even)"))
Run Code Online (Sandbox Code Playgroud)
会抛出错误:
NotImplementedError:nth-of-type伪类目前仅支持数值.
请注意,我们可以解决此问题.find_all():
print([row.get_text(strip=True) for index, row in enumerate(soup.find_all("tr"), start=1) if …Run Code Online (Sandbox Code Playgroud) 我正在尝试从公共站点asx.com.au获取数据
页面http://www.asx.com.au/asx/research/company.do#!/ACB/details包含一个div类"view-content",它包含我需要的信息:
但是当我尝试通过Python查看此页面时urllib2.urlopendiv为空:
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.asx.com.au/asx/research/company.do#!/ACB/details'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, "html.parser")
contentDiv = soup.find("div", {"class": "view-content"})
print(contentDiv)
# the results is an empty div:
# <div class="view-content" ui-view=""></div>
Run Code Online (Sandbox Code Playgroud)
是否可以通过编程方式访问该div的内容?
编辑:根据评论,显示内容通过Angular.js.是否可以通过Python触发该内容的呈现?
beautifulsoup ×10
python ×10
html-parsing ×5
html ×2
unicode ×2
web-scraping ×2
angularjs ×1
html-table ×1
lxml.html ×1
mechanize ×1
pip ×1
tags ×1
urllib2 ×1