标签: beautifulsoup

BeautifulSoup HTML表解析

我试图从这个网站解析信息(html表):http://www.511virginia.org/RoadConditions.aspx？j = All&r = 1

目前我正在使用BeautifulSoup,我的代码看起来像这样

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()

url = "http://www.511virginia.org/RoadConditions.aspx?j=All&r=1"
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)

table = soup.find("table")

rows = table.findAll('tr')[3]

cols = rows.findAll('td')

roadtype = cols[0].string
start = cols.[1].string
end = cols[2].string
condition = cols[3].string
reason = cols[4].string
update = cols[5].string

entry = (roadtype, start, end, condition, reason, update)

print entry

Run Code Online (Sandbox Code Playgroud)

问题在于开始和结束列.它们只是打印为"无"

输出:

(u'Rt. 613N (Giles County)', None, None, u'Moderate', u'snow or ice', …

Run Code Online (Sandbox Code Playgroud)

python html-table mechanize beautifulsoup html-parsing

Ste*_*ner

2017 02-20

17
推荐指数

1
解决办法

2万
查看次数

如何在BeautifulSoup中找到评论标签？

我尝试过soup.find('! - '),但它似乎不起作用.提前致谢.

编辑:感谢您提供有关如何查找所有评论的提示.我有一个跟进问题.我如何专门搜索评论？

例如,我有以下评论标记:

我真的只想要这些东西Wednesday 110518."110518"是YYMMDD的日期,我倾向于将其用作我的搜索目标.但是,我不知道如何在特定注释标签中找到某些内容.

html python tags beautifulsoup

1st*_*age

2011 05-20

17
推荐指数

2
解决办法

1万
查看次数

写入文件时的UnicodeEncodeError

我正在尝试将一些字符串写入文件(字符串已由HTML解析器BeautifulSoup提供给我).

我可以使用"print"来显示它们,但是当我使用file.write()时,我收到以下错误:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 6: ordinal not in range(128)

Run Code Online (Sandbox Code Playgroud)

我怎么解析这个？

python unicode beautifulsoup

Ror*_*ory

2017 06-28

17
推荐指数

3
解决办法

2万
查看次数

将python脚本输出输出到文件时出现Unicode错误

这是代码:

print '"' + title.decode('utf-8', errors='ignore') + '",' \
      ' "' + title.decode('utf-8', errors='ignore') + '", ' \
      '"' + desc.decode('utf-8', errors='ignore') + '")'

Run Code Online (Sandbox Code Playgroud)

标题和desc由Beautiful Soup 3(p [0] .text和p [0] .prettify)返回,据我所知,BeautifulSoup3文档是UTF-8编码的.

如果我跑

python.exe script.py > out.txt

Run Code Online (Sandbox Code Playgroud)

我收到以下错误:

Traceback (most recent call last):
  File "script.py", line 70, in <module>
    '"' + desc.decode('utf-8', errors='ignore') + '")'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 264
: ordinal not in range(128)

Run Code Online (Sandbox Code Playgroud)

但是,如果我跑

python.exe script.py

Run Code Online (Sandbox Code Playgroud)

我没有错.仅在指定输出文件时才会发生.

如何在输出文件中获得良好的UTF-8数据？

python unicode beautifulsoup

Kai*_*eks

lucky-day

17
推荐指数

2
解决办法

1万
查看次数

BeautifulSoup4藏在哪里？

我做了sudo pip install BeautifulSoup4并得到了非常乐观的回应:

Downloading/unpacking beautifulsoup4
  Running setup.py egg_info for package beautifulsoup4
Installing collected packages: beautifulsoup4
  Running setup.py install for beautifulsoup4
Successfully installed beautifulsoup4
Cleaning up..

Run Code Online (Sandbox Code Playgroud)

但是当我尝试使用import BeautifulSoup4或from BeautifulSoup4 import BeautifulSoup4在脚本中时,python说这个名称没有模块.

> import BeautifulSoup
ImportError: No module named BeautifulSoup

Run Code Online (Sandbox Code Playgroud)

更新:pip告诉我,beautifulsoup4 in /usr/local/lib/python2.6/dist-packages但我正在运行2.7.2+(并print sys.path看到2.7路径)...所以现在我需要弄清楚为什么pip把东西放在错误的地方.

python pip beautifulsoup

Ama*_*nda

2015 10-23

17
推荐指数

2
解决办法

1万
查看次数

如何使用Python提取在HTML页面javascript块中定义的JSON对象？

我正在下载以下列方式定义数据的HTML页面:

... <script type= "text/javascript">    window.blog.data = {"activity":{"type":"read"}}; </script> ...

Run Code Online (Sandbox Code Playgroud)

我想提取'window.blog.data'中定义的JSON对象.有没有比手动解析更简单的方法？(我正在寻找美丽的肥皂,但似乎无法找到一个方法,将返回确切的对象而不解析)

谢谢

编辑: 使用python无头浏览器(例如,Ghost.py)执行此操作是否可行且更正确？

python beautifulsoup html-parsing headless-browser

use*_*956

2012 11-11

17
推荐指数

2
解决办法

2万
查看次数

如何使用Python中的BeautifulSoup保存对HTML文件所做的更改？

我有下面的脚本,它修改hrefHTML文件中的属性(将来,它将是目录中的HTML文件列表).使用BeautifulSoup我设法访问标记值并按我的意愿修改它们,但我不知道如何保存对文件所做的更改.

import os
import re
from bs4 import BeautifulSoup


htmlDoc = open('adding_computer_c.html',"r+")
soup = BeautifulSoup(htmlDoc)

replacements= [ ('_', '-'), ('../tasks/', prefixUrl), ('../concepts/', prefixUrl) ]

for link in soup.findAll('a', attrs={'href': re.compile("../")}):


    newlink=str(link)

    for k, v in replacements:

        newlink = newlink.replace(k, v)

    extrachars=newlink[newlink.find("."):newlink.find(">")]
    newlink=newlink.replace(extrachars,'')


    link=newlink
    print(link)
    ##How do I save the link I have modified back to the HTML file?

print(soup)##prints the original html tree

htmlDoc.close()

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup html-parsing

Pep*_*oyd

2019 01-10

17
推荐指数

1
解决办法

2万
查看次数

Python BeautifulSoup刮表

我正在尝试用BeautifulSoup创建一个表刮.我写了这个Python代码:

import urllib2
from bs4 import BeautifulSoup

url = "http://dofollow.netsons.org/table1.htm"  # change to whatever your url is

page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)

for i in soup.find_all('form'):
    print i.attrs['class']

Run Code Online (Sandbox Code Playgroud)

我需要刮Nome,Cognome,Email.

html python beautifulsoup html-parsing web-scraping

kin*_*ope

2015 06-29

17
推荐指数

1
解决办法

5万
查看次数

在BeautifulSoup中扩展CSS选择器

问题:

BeautifulSoup为CSS选择器提供非常有限的支持.例如,唯一支持的伪类是nth-of-type,它只能接受数值 - 参数喜欢even或odd不允许.

是否可以扩展BeautifulSoupCSS选择器或让它在lxml.cssselect内部用作底层CSS选择机制？

我们来看一个示例问题/用例.在以下HTML中仅查找偶数行:

<table>
    <tr>
        <td>1</td>
    <tr>
        <td>2</td>
    </tr>
    <tr>
        <td>3</td>
    </tr>
    <tr>
        <td>4</td>
    </tr>
</table>

Run Code Online (Sandbox Code Playgroud)

在lxml.html和中lxml.cssselect,很容易做到:nth-of-type(even):

from lxml.html import fromstring
from lxml.cssselect import CSSSelector

tree = fromstring(data)

sel = CSSSelector('tr:nth-of-type(even)')

print [e.text_content().strip() for e in sel(tree)]

Run Code Online (Sandbox Code Playgroud)

但是,在BeautifulSoup:

print(soup.select("tr:nth-of-type(even)"))

Run Code Online (Sandbox Code Playgroud)

会抛出错误:

NotImplementedError:nth-of-type伪类目前仅支持数值.

请注意,我们可以解决此问题.find_all():

print([row.get_text(strip=True) for index, row in enumerate(soup.find_all("tr"), start=1) if …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup css-selectors html-parsing lxml.html

ale*_*cxe

2015 12-22

17
推荐指数

1
解决办法

1701
查看次数

网页抓取 - 如何通过Angular.js访问用JavaScript呈现的内容？

我正在尝试从公共站点asx.com.au获取数据

页面http://www.asx.com.au/asx/research/company.do#!/ACB/details包含一个div类"view-content",它包含我需要的信息:

但是当我尝试通过Python查看此页面时urllib2.urlopendiv为空:

import urllib2
from bs4 import BeautifulSoup

url = 'http://www.asx.com.au/asx/research/company.do#!/ACB/details'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, "html.parser")
contentDiv = soup.find("div", {"class": "view-content"})
print(contentDiv)

# the results is an empty div:
# <div class="view-content" ui-view=""></div>

Run Code Online (Sandbox Code Playgroud)

是否可以通过编程方式访问该div的内容？

编辑:根据评论,显示内容通过Angular.js.是否可以通过Python触发该内容的呈现？

python urllib2 beautifulsoup web-scraping angularjs

Ste*_*ead

2016 01-28

17
推荐指数

1
解决办法

1万
查看次数