标签: beautifulsoup

没有类BS4 python的刮擦表

我有以下代码试图从一个表中抓取数据，该表没有来自具有许多其他不必要表的网页的类。

from bs4 import BeautifulSoup
import urllib2
import re
wiki = "http://www.maditssia.com/members/list.php?p=1&id=Engineering%20Industries"
header = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
title = ""
address = ""
contact = ""
phone = ""
description=""
email=""
table = soup.find("table")
#print table.text
#print re.sub(r'\s+',' ',''.join(table.text).encode('utf-8'))
for row in table.findAll("tr"):
   cells = row.findAll("td")
   if len(cells) >= 7:
    title = cells[0].find(text=True)
    address = cells[1].find(text=True)
    contact = cells[2].find(text=True)
    phone = cells[3].find(text=True)
    email= cells[4].find(text=True)
    description= cells[5].find(text=True)
    data = title + "," + …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup web-scraping python-2.7

Ven*_*raj

lucky-day

1
推荐指数

1
解决办法

2590
查看次数

Beautiful Soup 访问同一个类的第二个 <div

我正在抓取一个包含两个相同类的“钩子”的 html 文档，如下所示：

<div class="multiRow">
    <!--ModuleId 372329FileName @swMultiRowsContainer-->
    <some more content>
</div>
<div class="multiRow">
    <!--ModuleId 372330FileName @multiRowsContainer-->
    <some more content>
</div>

Run Code Online (Sandbox Code Playgroud)

当我做：

mr = ct[0].find_all('div', {'class': 'multiRow'})

Run Code Online (Sandbox Code Playgroud)

我只从第一个获取内容有没有办法访问第二个中的内容？

谢谢！

html css python beautifulsoup web-scraping

Anu*_*nuj

lucky-day

1
推荐指数

1
解决办法

1万
查看次数

Beautifulsoup：如果标签或元素未知，如何查找字符串？

正如它所说的。无论如何要在整个 DOM 中搜索特定文本，例如CAPTCHA单词？

html python beautifulsoup html-parsing web-scraping

Vol*_*il3

2014 05-06

1
推荐指数

1
解决办法

1141
查看次数

使用 Python 从大文件中解析 XML

我有一个 50MB 的 xml 文件，我需要从中读取一些数据。我的方法是使用 Beautifulsoup 4，因为我已经使用该软件包一段时间了。这段代码显示了我是如何做到的：

from bs4 import Beautifulsoup

# since the file is big, this line takes minutes to execute
soup = Beautifulsoup(open('myfile.xml'), 'xml')

items = soup.find_all('item')

for item in items:
    name = item['name']
    status = item.find('status').text
    description = item.find('desc').text
    refs = item.findAll('ref')
    data = []
    for ref in refs:
        if 'url' in ref.attrs:
            data.append('%s:%s' % (ref['source'], ref['url']))
        else:
            data.append('%s:%s' % (ref['source'], ref.text))

    do_something(data)

Run Code Online (Sandbox Code Playgroud)

该文件不是复杂的 xml，我只需要读取每个<item>条目的每个数据：

<item type="CVE" name="some-name" seq="1999-0003">
  <status>Entry</status>
  <desc>A description goes here.</desc> …

Run Code Online (Sandbox Code Playgroud)

python xml beautifulsoup

Pep*_*zza

2014 05-20

1
推荐指数

1
解决办法

1228
查看次数

BeautifulSoup 在 Amazon EC2 机器上表现不同

我正在运行以下脚本：

from bs4 import BeautifulSoup
import urllib2
import sys

print sys.version

url = 'https://www.google.com/finance'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)

trends_tag = soup.find('div', {'id': 'topmovers'})

tags = trends_tag.find_all('td', 'change chg')
print len(tags)

tag = tags[0]
print 'Tag: ' + tag.text

Run Code Online (Sandbox Code Playgroud)

在我的电脑上，输出是：

2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
11
Tag: 33.24%

Run Code Online (Sandbox Code Playgroud)

在 EC2 机器上，输出为：

2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
11
Tag: 33.24%
12.18B


CLX

The Clorox Co
7.35%
11.67B


THOR …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup amazon-ec2 web-scraping python-2.7

Nit*_*tay

lucky-day

1
推荐指数

1
解决办法

979
查看次数

有人可以帮我理解beautifulsoup3文档中的这段代码吗？特别是我不明白方括号中的部分。代码来自这个网址：http : //www.crummy.com/software/BeautifulSoup/bs3/documentation.html 我不明白方括号，因为我认为方括号是用来制作列表的，它的内容是否创建了一个列表？此外，它似乎没有将列表分配给任何东西。使用方括号而不将它们分配给任何东西的目的是什么？另外，我不理解这个组件: text=lambda text:isinstance(text, Comment)，但我想我可能能够自己弄清楚那部分。

from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
                        <a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>

Run Code Online (Sandbox Code Playgroud)

好的，这是为了理解列表，所以，正在制作一个列表？但是没有被使用？他们为什么要这样做？另外，你什么时候以及为什么要在“for”这个词之前加上任何东西？正如他们在那里所做的那样。通常我在开头看到“for”，在它之前没有任何内容。另外，感谢您对 lambda 函数的精彩解释，我知道它制作了某种迷你函数，但我还不太熟悉它，它有助于了解您如何将其重新编写为普通函数。

html python beautifulsoup html-parsing python-3.x

jel*_*ngo

2014 12-11

1
推荐指数

1
解决办法

1186
查看次数

Python获取点击值

我正在使用 Python 和 BeautifulSoup 为我的一个小项目抓取网页。该网页有多个条目，每个条目由 HTML 中的表格行分隔。我的代码部分工作但是很多输出是空白的，它不会从网页中获取所有结果，甚至不会将它们收集到同一行中。

<html>
<head>
<title>Sample Website</title>
</head>
<body>

<table>
<td class=channel>Artist</td><td class=channel>Title</td><td class=channel>Date</td><td class=channel>Time</td></tr>
<tr><td>35</td><td>Lorem Ipsum</td><td><a href="#" onClick="searchDB('LoremIpsum','FooWorld')">FooWorld</a></td><td>12/10/2014</td><td>2:53:17 PM</td></tr>
</table>
</body>
</html>

Run Code Online (Sandbox Code Playgroud)

我只想从 onclick 操作“searchDB”中提取值，例如“LoremIpsum”和“FooWorld”是我唯一想要的两个结果。

这是我写的代码。到目前为止，它正确地提取了一些写入值，但有时这些值是空的。

response = urllib2.urlopen(url)

html = response.read()

soup = bs4.BeautifulSoup(html)

properties = soup.findAll('a', onclick=True)

for eachproperty in properties:
    print re.findall("'([a-zA-Z0-9]*)'", eachproperty['onclick'])

Run Code Online (Sandbox Code Playgroud)

我究竟做错了什么？

python beautifulsoup web-scraping

Fai*_*ony

2015 09-18

1
推荐指数

1
解决办法

1万
查看次数

python urllib3登录+搜索

import urllib3
import io
from bs4 import BeautifulSoup
import re
import cookielib

http = urllib3.PoolManager()
url = 'http://www.example.com'
headers = urllib3.util.make_headers(keep_alive=True,user_agent='Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6')
r = http.urlopen('GET', url, preload_content=False)

# Params die dann am Post request übergeben werden
params = {
    'login': '/shop//index.php',
    'user': 'username',
    'pw': 'password'
  }
suche = {
    'id' : 'searchfield',
    'name' : 'suche',
    }

# Post Anfrage inkl params (login) Antwort in response.data
response = http.request('POST', url, params, headers)
suche …

Run Code Online (Sandbox Code Playgroud)

python search login beautifulsoup urllib3

ove*_*iew

lucky-day

1
推荐指数

1
解决办法

3695
查看次数

使用 BeautifulSoup 获取没有标签的文本？

我一直在使用 BeautifulSoup 来解析 HTML 文档，但似乎遇到了问题。我发现了一些需要提取的文本，但文本很简单。没有标签或任何东西。我不确定是否需要使用 Regex 来执行此操作，因为我不知道是否可以使用 BeautifulSoup 抓取文本，因为它不包含任何标签。

<strike style="color: #777777">975</strike> 487 RP<div class="gs-container default-2-col">

Run Code Online (Sandbox Code Playgroud)

我正在尝试提取“487”。

谢谢！

html python regex parsing beautifulsoup

cod*_*ane

lucky-day

1
推荐指数

1
解决办法

1809
查看次数

AttributeError: 'ResultSet' 对象没有属性 'find_all' Beautifulsoup

我不明白为什么会出现此错误：

我有一个相当简单的功能：

def scrape_a(url):
  r = requests.get(url)
  soup = BeautifulSoup(r.content)
  news =  soup.find_all("div", attrs={"class": "news"})
  for links in news:
    link = news.find_all("href")
    return link

Run Code Online (Sandbox Code Playgroud)

这是我试图抓取的网页的结构：

<div class="news">
<a href="www.link.com">
<h2 class="heading">
heading
</h2>
<div class="teaserImg">
<img alt="" border="0" height="124" src="/image">
</div>
<p> text </p>
</a>
</div>

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup web-scraping

Imo*_*Imo

lucky-day

1
推荐指数

1
解决办法

8169
查看次数