标签: beautifulsoup

如何测试美汤对象的类型？

这可能是一个非常基本的 Python 问题，尽管我在 Beautiful Soup 中遇到了它。

我想做的基本事情是仅从 HTML 文件中提取输出文本。

例如，在下面包含的 HTML 文件中，我只想提取 0123、abc、def 和 ghi，但不提取标签和属性。

尽我所知 BS 我应该能够通过 HTML 标记的后代进行递归，并且只包含 NavigableStrings 的内容。

问题是我不知道如何编写 if 语句来测试类型。请参阅下面 python 代码中的注释。

任何解决方案？

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>0123</title>
</head>
<body>
    <div>
        <p>abc</p>def
        <a href="wxy.z">ghi</a>
    </div>
</body>
</html>


# -*- coding: UTF-8 -*-
from bs4 import BeautifulSoup

with open('simple.html', 'r') as inf:
    soup = BeautifulSoup(inf.read(), 'lxml')
    for e in soup('html'):
        for d in e.descendants:
            print d     # HERE I WANT TO SKIP EXCEPT …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup python-2.7

Wes*_*esR

lucky-day

2
推荐指数

1
解决办法

3459
查看次数

使用 BS4 "lxml" 抓取 XML 数据

试图解决与此非常相似的问题：

[用beautifulsoup抓取XML元素属性

我有以下代码：

from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.usda.gov/oce/commodity/wasde/latest.xml')
data = r.text
soup = BeautifulSoup(data, "lxml")
for ce in soup.find_all("Cell"):
    print(ce["cell_value1"])

Run Code Online (Sandbox Code Playgroud)

代码运行没有错误，但不会向终端打印任何值。

我想为整个页面提取上面提到的“cell_value1”数据，所以我有这样的东西：

2468.58
3061.58
376.64
and so on...

Run Code Online (Sandbox Code Playgroud)

我的 XML 文件的格式与上述问题的解决方案中的示例相同。我确定了特定于我想要抓取的属性的适当属性标签。为什么这些值没有打印到终端？

python lxml beautifulsoup elementtree python-3.x

gab*_*abe

2018 04-04

2
推荐指数

1
解决办法

1671
查看次数

无法在 Mac Os 上的 python 中安装 beautifulsoup4

我正在尝试beautifulsoup4使用以下命令在我的 mac 中安装：

pip3 install beautifulsoup4

Run Code Online (Sandbox Code Playgroud)

但我收到以下错误：

Could not find a version that satisfies the requirement beautifulsoup4 (from versions: )
No matching distribution found for beautifulsoup4

Run Code Online (Sandbox Code Playgroud)

我该如何解决这个问题？

macos beautifulsoup python-3.x

sag*_*uri

2018 04-07

2
推荐指数

1
解决办法

9580
查看次数

XML 到 CSV Python

状态的 XML 数据（file.xml）如下所示

<?xml version="1.0" encoding="UTF-8" standalone="true"?>
<Activity_Logs xsi:schemaLocation="http://www.cisco.com/PowerKEYDVB/Auditing 
DailyActivityLog.xsd" To="2018-04-01" From="2018-04-01" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://www.cisco.com/PowerKEYDVB/Auditing">
    <ActivityRecord>
       <time>2015-09-16T04:13:20Z</time>
       <oper>Create_Product</oper>
       <pkgEid>10</pkgEid>
       <pkgName>BBCWRL</pkgName>
       </ActivityRecord>
    <ActivityRecord>
       <time>2015-09-16T04:13:20Z</time>
       <oper>Create_Product</oper>
       <pkgEid>18</pkgEid>
       <pkgName>CNNINT</pkgName>
    </ActivityRecord>

Run Code Online (Sandbox Code Playgroud)

上述 XML 文件的解析和转换为 CSV 将由以下 python 代码完成。

import csv
import xml.etree.cElementTree as ET


tree =  ET.parse('file.xml')
root = tree.getroot()


data_to_csv= open('output.csv','w')

list_head=[]

Csv_writer=csv.writer(data_to_csv)

count=0
for elements in root.findall('ActivityRecord'):
    List_node = []
    if count == 0 :

        time = elements.find('time').tag
        list_head.append(time)

        oper = elements.find('oper').tag
        list_head.append(oper)

        pkgEid = elements.find('pkgEid').tag
        list_head.append(pkgEid)


        pkgName = elements.find('pkgName').tag
        list_head.append(pkgName)

        Csv_writer.writerow(list_head) …

Run Code Online (Sandbox Code Playgroud)

csv beautifulsoup python-3.x xml.etree pandas

Nip*_*nna

2018 04-19

2
推荐指数

2
解决办法

1万
查看次数

将多个 div 类中的数据抓取到 Pandas 数据框中

我正在从仪表板中抓取一些数据，并坚持尝试将多个数据中的一些数据div classes放入 Pandas 数据框中。我应该如何尝试转换这样的东西：

[<div class="map-item" data-companyname="Apical Group" data-country="INDONESIA" data-district="Jakarta Utara" data-latitude="-6.099396000" data-longitude="106.951478000" data-millname="AAJ Marunda" data-province="Jakarta" data-report="http://naturalhealthytreat.com/sites/neste-daemeter.com/files/AAJ_Marunda.pdf" id="map_item_4645">AAJ Marunda</div>,
 <div class="map-item" data-companyname="Apical Group" data-country="INDONESIA" data-district="Lubuk Gaung" data-latitude="1.754005000" data-longitude="101.363532000" data-millname="Sari Dumai Sejati" data-province="Riau" data-report="http://naturalhealthytreat.com/sites/neste-daemeter.com/files/Sari_Dumai_Sejati.pdf" id="map_item_4646">Sari Dumai Sejati</div>,
 <div class="map-item" data-companyname="Kutai Refinery Nusantara " data-country="INDONESIA" data-district="Balikpapan" data-latitude="-1.179099000" data-longitude="116.788274000" data-millname="Kutai Refinery Nusantara " data-province="Penajam Paser Utara" data-report="http://naturalhealthytreat.com/sites/neste-daemeter.com/files/Kutai_Refinery_Nusantara_.pdf" id="map_item_4647">Kutai Refinery Nusantara </div>]

Run Code Online (Sandbox Code Playgroud)

变成这样的数据框：

no  companyname country district    latitude    longitude   millname    province    report
1   Apical Group    INDONESIA   Jakarta Utara   -6.099396   106.951478  AAJ Marunda …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup pandas

Fun*_*keh

lucky-day

2
推荐指数

1
解决办法

1385
查看次数

美丽的汤附加

我不明白为什么这不起作用。

soup_main = BeautifulSoup('<html><head></head><body><a>FooBar</a></body></html>')
soup_append = BeautifulSoup('<html><head></head><body><a>Meh</a></body></html>')
soup_main.body.append(soup_append.a)

Run Code Online (Sandbox Code Playgroud)

我收到以下错误：

Traceback (most recent call last):File "<stdin>", line 1, in <module>
File "C:\Python34\lib\site-packages\bs4\element.py", line 378, in append
self.insert(len(self.contents), tag)
File "C:\Python34\lib\site-packages\bs4\element.py", line 312, in insert
raise ValueError("Cannot insert None into a tag.")
ValueError: Cannot insert None into a tag.

Run Code Online (Sandbox Code Playgroud)

如果我能理解正在发生的事情，我会很高兴。

python beautifulsoup

J.E*_*Ewa

lucky-day

2
推荐指数

1
解决办法

1360
查看次数

python请求和beautifulsoup机器人检测

我正在尝试使用请求和beautifulsoup 抓取页面的所有HTML 元素。我正在使用 ASIN（亚马逊标准识别码）来获取页面的产品详细信息。我的代码如下：

from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup

url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = urlopen(url)
soup = BeautifulSoup(response, "html.parser")
print(soup)

Run Code Online (Sandbox Code Playgroud)

但输出并未显示页面的整个 HTML，因此我无法进一步处理产品详细信息。这有什么帮助吗？

编辑 1：

从给定的答案中，它显示了机器人检测页面的标记。我研究了一下并找到了两种破坏它的方法：

我可能需要在请求中添加一个标头，但我不明白标头的值应该是什么。
使用硒。现在我的问题是，这两种方式都提供平等的支持吗？

html python beautifulsoup web-scraping python-requests

Pro*_*ova

2018 08-29

2
推荐指数

1
解决办法

7053
查看次数

getText() 与 text() 与 get_text()

我用 bs4 提取了一大块 html，如下所示

<div class="a-section a-spacing-small" id="productDescription">
<!-- show up to 2 reviews by default -->
<p>Satin Smooth Universal Protective Wax Pot Collars by Satin Smooth</p>
</div>

Run Code Online (Sandbox Code Playgroud)

提取我使用的文本 text.strip()

output.text()

Run Code Online (Sandbox Code Playgroud)

它给了我输出 "TypeError: 'str' object is not callable"

当我使用output.get_text()and 时output.getText()，我得到了想要的文本

这3个有什么区别？为什么 get_text() 和 getText() 给出相同的输出？

python beautifulsoup python-3.x

Pro*_*ova

2018 08-30

2
推荐指数

1
解决办法

2689
查看次数

网页抓取duckduckgo，但获取的链接格式错误

我Python 3使用BeautifulSoup库创建了一个脚本。它的作用是duckduckgo使用以下网址进入搜索引擎：https://duckduckgo.com/?q=searchterm然后，它将向我显示第一页中的所有网站。

这是代码，它运行良好：

import requests
from bs4 import BeautifulSoup

r = requests.get('https://duckduckgo.com/html/?q=test')
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('a', attrs={'class':'result__a'})

i = 0
while i < len(results):
    link = results[i]
    url = link['href']
    print(url)
    i = i + 1

Run Code Online (Sandbox Code Playgroud)

问题是，我没有以正确的格式获取网址（例如：https://www.google.com）。相反，我以搜索查询的格式获取所有网址。

这是我test在duckduckgo上搜索时的意思：

/l/?kh=-1&uddg=https%3A%2F%2Fduckduckgo.com%2Fy.js%3Fu3%3Dhttps%253A%252F%252Fr.search.yahoo.com%252Fcbclk%252FdWU9MEQwQzVENEZDNDU0NDlEMyZ1dD0xNTM4MzE4MTI3MzE5JnVvPTc3NTg0MzM1OTYxMTUyJmx0PTImZXM9ZVBGTU9iWUdQUy42cVdRVQ%252D%252D%252FRV%253D2%252FRE%253D1538346927%252FRO%253D10%252FRU%253Dhttps%25253a%25252f%25252fwww.bing.com%25252faclick%25253fld%25253dd3peyDLOVSWraifG78tpZ1GjVUCUzCMDkx%252DfJrFXeY2IfiXIwUmngX%252DYKvZWQ6q7hPHC_3kc%252DzBWS1SE015Or2c3CncFMVc9OjVV5OyB2kJqXdRsOzRnaCGy8gYCPuival0gLe7WCkfk_%252DAVKTWmYxranfh02ficTC7i6oC38n2q9U9KPe%252526u%25253dhttps%2525253a%2525252f%2525252fwww.dotdrugconsortium.com%2525252f%2525253futm_source%2525253dbing%25252526utm_medium%2525253dcpc%25252526utm_campaign%2525253dadcenter%25252526utm_term%2525253ddottest%252526rlid%25253d590f68ae34ff126ed0e3331eebd0c4fb%252FRK%253D2%252FRS%253DeKe3rY19jdg9vb_ayBSboMzPU1g%252D%26ad_provider%3Dyhs%26vqd%3D3%2D12729109948094676568590283448597440227%2D122882305188756590950269013545136161936
/l/?kh=-1&uddg=https%3A%2F%2Fwww.merriam%2Dwebster.com%2Fdictionary%2Ftest
/l/?kh=-1&uddg=https%3A%2F%2Fwww.speedtest.net%2F
/l/?kh=-1&uddg=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FTest
/l/?kh=-1&uddg=https%3A%2F%2Fwww.dictionary.com%2Fbrowse%2Ftest
/l/?kh=-1&uddg=https%3A%2F%2Fwww.thefreedictionary.com%2Ftest
/l/?kh=-1&uddg=https%3A%2F%2Fwww.16personalities.com%2F
/l/?kh=-1&uddg=https%3A%2F%2Fwww.speakeasy.net%2Fspeedtest%2F
/l/?kh=-1&uddg=http%3A%2F%2Fwww.humanmetrics.com%2Fcgi%2Dwin%2Fjtypes2.asp
/l/?kh=-1&uddg=https%3A%2F%2Fwww.typingtest.com%2F%3Fab
/l/?kh=-1&uddg=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FTest_cricket
/l/?kh=-1&uddg=https%3A%2F%2Fged.com%2F
/l/?kh=-1&uddg=http%3A%2F%2Fspeedtest.xfinity.com%2F
/l/?kh=-1&uddg=https%3A%2F%2Fwww.16personalities.com%2Ffree%2Dpersonality%2Dtest
/l/?kh=-1&uddg=https%3A%2F%2Fwww.merriam%2Dwebster.com%2Fthesaurus%2Ftest
/l/?kh=-1&uddg=http%3A%2F%2Ftest%2Dipv6.com%2F
/l/?kh=-1&uddg=https%3A%2F%2Fwww.thesaurus.com%2Fbrowse%2Ftest
/l/?kh=-1&uddg=http%3A%2F%2Fspeedtest.att.com%2Fspeedtest%2F
/l/?kh=-1&uddg=http%3A%2F%2Fspeedtest.googlefiber.net%2F
/l/?kh=-1&uddg=http%3A%2F%2Ftest.salesforce.com%2F
/l/?kh=-1&uddg=https%3A%2F%2Fmy.uscis.gov%2Fprep%2Ftest%2Fcivics
/l/?kh=-1&uddg=https%3A%2F%2Fwww.tests.com%2F
/l/?kh=-1&uddg=https%3A%2F%2Fen.wiktionary.org%2Fwiki%2FTest
/l/?kh=-1&uddg=https%3A%2F%2Ftestmy.net%2F
/l/?kh=-1&uddg=https%3A%2F%2Fwww.google.com%2F
/l/?kh=-1&uddg=https%3A%2F%2Fwww.queendom.com%2Ftests%2Findex.htm
/l/?kh=-1&uddg=http%3A%2F%2Fwww.yourdictionary.com%2Ftest …

Run Code Online (Sandbox Code Playgroud)

html javascript python beautifulsoup web-scraping

Lok*_*ont

2018 09-30

2
推荐指数

1
解决办法

3096
查看次数

在 PyCharm 中导入 bs4 或 BeautifulSoup4“未解决”。安装没问题

我想在 PyCharm 2018.3.2 中使用 BeautifulSoup4。问题是，“bs4”和“BeautifulSoup”/“BeautifulSoup4”在编写时在PyCharm中带有红色下划线：

from bs4 import BeautifulSoup (or BeautifulSoup4)

Run Code Online (Sandbox Code Playgroud)

没有其他东西无法导入，只有这个模块。红色下划线告诉我“bs4”和“BeautifulSoup(4)”是一样的：

“未解决的引用 'bs4' 少... (Ctrl+F1)

检查信息：此检查检测应该解析但不解析的名称。由于动态调度和鸭子类型，这在有限但有用的情况下是可能的。比实例项更好地支持顶级和类级项"

当我运行 PyCharm 时，错误显示：“ModuleNotFoundError: No module named 'bs4'”

在 cmd pip3 中已经正确安装了所有东西。不得不卸载并安装几次才能确定。Atm 安装后如下所示：

蟒蛇命令。这里不是管理模式，但也尝试过

还有这个：

pip check bs4
pip check beautifulsoup4

Run Code Online (Sandbox Code Playgroud)

说：“没有发现损坏的要求。”

我对 python 相当陌生，但我在网上搜索了答案，但没有找到“未解决的参考”。以为可能是路径惹的祸，但最近安装的请求和模块放在我的存档中的同一个文件夹中，因此 PyCharm 应该能够在能够找到请求时找到 bs4。

beautifulsoup python-import pycharm python-3.x python-3.7

Sos*_*fie

2019 03-25

2
推荐指数

1
解决办法

7207
查看次数

标签统计

beautifulsoup ×10

python ×7

python-3.x ×5

html ×2

pandas ×2

web-scraping ×2

csv ×1

elementtree ×1

javascript ×1

lxml ×1

macos ×1

pycharm ×1

python-2.7 ×1

python-3.7 ×1

python-import ×1

python-requests ×1

xml.etree ×1

标签 统计

标签统计