标签: beautifulsoup

Beautiful Soup 获取 div 中第一个表的数据

我试图从包含许多表的 div 中获取第一个表的数据，我该怎么做？

<div class="mainCont port_hold ">
    <table class="tblpor">
        <tr><th>Company</th><th>Code</th></tr>
        <tr><td>ABC    </td><td>1234</td></tr>
        <tr><td>XYZ    </td><td>6789</td></tr>
    </table>

    <table class="tblpor MT25">
        <tr><th>Company</th><th>Industry</th></tr>
        <tr><td>ABCDEF </td><td>aaaaa   </td></tr>
        <tr><td>STUVWX </td><td>bbbbb   </td></tr>
    </table>
</div>

Run Code Online (Sandbox Code Playgroud)

我需要表 class="tblpor" 的数据，以下是我创建的代码，但是它为我提供了 div 中所有表的数据。

for x in soup2.find('table', class_='tblpor'):
    for y in soup2.findAll('tr'):
        for z in soup2.findAll('td'):
            print(z.text)

Run Code Online (Sandbox Code Playgroud)

请帮忙。

问候， babsdoc

python beautifulsoup

bab*_*doc

lucky-day

1
推荐指数

1
解决办法

1983
查看次数

使用 Beautiful Soup 将多个类提取到 Pandas 数据框中

我想获得以下熊猫数据框：

这是我尝试过的所有内容，确实尝试通过课程，但确实提供了所有内容，而不是我正在寻找的单独内容。我是 bs4 的新手。

html_doc = """
<div class="schoolinfo" data-attr-lat="33.7527" data-attr-lon="-84.3867" id="1396">
      <div class="schoolheader">
       <h3 class="schoolname">
        Georgia State University
       </h3>
      </div>
      <div class="schooldetails">
       <div class="schoollocation">
        <div class="citystate">
         Atlanta, Georgia
        </div>
       </div>
       <div class="programs">
        <div class="schoolprogram">
         <h4>
          <a href="http://cs.gsu.edu/graduate/doctor-philosophy/ph-d-bioinformatics-concentration-degree-requirements/" target="_blank">
           Ph.D. in Computer Science - Bioinformatics Concentration
          </a>
         </h4>
         <div class="cost-curric">
          <a class="btn btn-sm btn-default detailbutton" href="http://cs.gsu.edu/graduate/doctor-philosophy/ph-d-admission-requirements/" target="_blank">
           HOW TO APPLY
          </a>
          <a class="btn btn-sm btn-default detailbutton" href="https://catalog.gsu.edu/graduate20152016/computer-science/" target="_blank">
           CURRICULUM
          </a>
          <a class="btn btn-sm btn-default detailbutton" href="http://sfs.gsu.edu/tuition-fees/what-it-costs/tuition-and-fees/" target="_blank">
           COST
          </a>
         </div> …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup python-3.x

作者

2017 08-14

1
推荐指数

1
解决办法

915
查看次数

从 PythonAnywhere 抓取

I have a free account on PythonAnywhere from where I am trying to run the following script that locally works just fine.

I am wondering if the error I get is for technical reasons or just that PythonAnywhere forbids people to scrap from their platform for certain websites only?

Do you know of other free websites where I would be allowed to scrap anything?

import requests
from bs4 import BeautifulSoup as bs

def scrapMarketwatch(address):
    #creating formatting data from scrapdata
    r …

Run Code Online (Sandbox Code Playgroud)

beautifulsoup web-scraping pythonanywhere

use*_*529

lucky-day

1
推荐指数

1
解决办法

1369
查看次数

在 python for 循环中获取下一项

如果找到“X”，我有一个包含列表的 HTML 对象我想打印 x 和列表中的下一项：

for string in tr[30].strings:
      if string == 'X':
              print(string)
              print(string.next())

Run Code Online (Sandbox Code Playgroud)

获取错误：

类型错误：“NavigableString”对象不可调用

python beautifulsoup python-3.x

Lev*_*Lev

lucky-day

1
推荐指数

1
解决办法

9267
查看次数

使用 BeautifulSoup 进行分页

我正在尝试从以下网站获取一些数据。https://www.drugbank.ca/drugs

对于表格中的每一种药物，我都需要深入了解名称和其他一些特定特征，例如类别、结构化指示（请单击药物名称以查看我将使用的特征）。

我编写了以下代码，但问题是我无法让我的代码处理分页（如您所见，有 2000 多页！）。

import requests
from bs4 import BeautifulSoup


def drug_data():
url = 'https://www.drugbank.ca/drugs/'
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
for link in soup.select('name-head a'):
    href = 'https://www.drugbank.ca/drugs/' + link.get('href')
    pages_data(href)


def pages_data(item_url):
r = requests.get(item_url)
soup = BeautifulSoup(r.text, "lxml")
g_data = soup.select('div.content-container')

for item in g_data:
    print item.contents[1].text
    print item.contents[3].findAll('td')[1].text
    try:
        print item.contents[5].findAll('td',{'class':'col-md-2 col-sm-4'})
    [0].text
    except:
        pass
    print item_url
    drug_data()

Run Code Online (Sandbox Code Playgroud)

如何抓取所有数据并正确处理分页？

python pagination beautifulsoup

Liz*_*zou

2021 04-03

1
推荐指数

1
解决办法

5495
查看次数

用不同的方法使用美丽的汤来获取href

我正在尝试抓取一个网站。我学会了从两种资源中抓取：一种用于tag.get('href')从a标签中获取 href ，另一种用于tag['href']获取相同的内容。据我了解，他们都做同样的事情。但是当我尝试这段代码时：

link_list = [l.get('href') for l in soup.find_all('a')]

Run Code Online (Sandbox Code Playgroud)

它适用于该.get方法，但不适用于字典访问方式。

link_list = [l['href'] for l in soup.find_all('a')]

Run Code Online (Sandbox Code Playgroud)

这会抛出一个KeyError. 我对刮刮很陌生，所以如果这是一个愚蠢的，请原谅。

编辑 - 这两种方法都适用于 find 方法而不是 find_all。

python beautifulsoup keyerror

作者

2017 12-16

1
推荐指数

1
解决办法

8474
查看次数

我可以使用 requests.post 提交表单吗？

我试图从这个站点获取商店列表：http : //www.health.state.mn.us/divs/cfh/wic/wicstores/

我想获取当您单击“查看所有商店”按钮时生成的商店列表。我知道我可以使用 Selenium 或 MechanicalSoup 或...来做到这一点，但我希望使用请求。

看起来点击按钮提交了一个表单：

 <form name="setAllStores" id="setAllStores" action="/divs/cfh/wic/wicstores/index.cfm" method="post" onsubmit="return _CF_checksetAllStores(this)">
<input name="submitAllStores" id="submitAllStores"  type="submit" value="View All Stores" />

Run Code Online (Sandbox Code Playgroud)

但我不知道如何编写请求查询（或者甚至可能的话）。

到目前为止，我尝试的原因是以下方面的变化：

SITE = 'http://www.health.state.mn.us/divs/cfh/wic/wicstores/'
data = {'name': 'setAllStores', 'form': 'submitAllStores', 'input': 'submitAllStores'}
r = requests.post(SITE, data)

Run Code Online (Sandbox Code Playgroud)

但这不起作用。欢迎任何帮助/建议。

python beautifulsoup web-scraping python-requests

Tim*_*tty

2018 02-15

1
推荐指数

1
解决办法

4202
查看次数

如何使用lxml csssselctor从<a>元素中提取href？

def extract_page_data(html):
tree = lxml.html.fromstring(html)
item_sel = CSSSelector('.my-item')
text_sel = CSSSelector('.my-text-content')
time_sel = CSSSelector('.time')
author_sel = CSSSelector('.author-text')
a_tag = CSSSelector('.a')

    for item in item_sel(tree):
    yield {'href': a_tag(item)[0].text_content(),
           'my pagetext': text_sel(item)[0].text_content(),
           'time': time_sel(item)[0].text_content().strip(),
           'author': author_sel(item)[0].text_content()}

Run Code Online (Sandbox Code Playgroud)

我想提取href但我无法使用此代码提取它

lxml beautifulsoup python-3.x lxml.html

elr*_*man

2018 02-28

1
推荐指数

1
解决办法

983
查看次数

请求返回响应 447

我正在尝试使用请求和 BeautifulSoup 来抓取网站。当我运行代码来获取网页的标签时，soup 对象是空白的。我把请求对象打印出来看看请求是否成功，没有。打印结果显示响应 447。我无法找到 447 作为 HTTP 状态代码的含义。有谁知道我如何成功连接和抓取网站？

代码：

r = requests.get('https://foobar)
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.get_text())

Output:
''

Run Code Online (Sandbox Code Playgroud)

当我打印请求对象时：

print(r)

Output:
<Response [447]>

Run Code Online (Sandbox Code Playgroud)

http beautifulsoup request web-scraping python-3.x

Ele*_*rse

lucky-day

1
推荐指数

1
解决办法

748
查看次数

如何使用请求或其他模块从 url 不变的页面获取数据？

我目前正在使用selenium去一个页面：

https://www.nseindia.com/products/content/derivatives/equities/historical_fo.htm

Run Code Online (Sandbox Code Playgroud)

然后选择相关选项并单击Get Data按钮。

然后检索使用生成的表BeautifulSoup。

在这种情况下有没有办法使用请求？如果是这样，是否有人可以指向我的教程？

python beautifulsoup python-requests

Sid*_*Sid

lucky-day

1
推荐指数

1
解决办法

560
查看次数

标签统计

beautifulsoup ×10

python ×7

python-3.x ×4

web-scraping ×3

python-requests ×2

http ×1

keyerror ×1

lxml ×1

lxml.html ×1

pagination ×1

pythonanywhere ×1

request ×1

标签 统计

标签统计