这可能是一个非常基本的 Python 问题,尽管我在 Beautiful Soup 中遇到了它。
我想做的基本事情是仅从 HTML 文件中提取输出文本。
例如,在下面包含的 HTML 文件中,我只想提取 0123、abc、def 和 ghi,但不提取标签和属性。
尽我所知 BS 我应该能够通过 HTML 标记的后代进行递归,并且只包含 NavigableStrings 的内容。
问题是我不知道如何编写 if 语句来测试类型。请参阅下面 python 代码中的注释。
任何解决方案?
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>0123</title>
</head>
<body>
<div>
<p>abc</p>def
<a href="wxy.z">ghi</a>
</div>
</body>
</html>
# -*- coding: UTF-8 -*-
from bs4 import BeautifulSoup
with open('simple.html', 'r') as inf:
soup = BeautifulSoup(inf.read(), 'lxml')
for e in soup('html'):
for d in e.descendants:
print d # HERE I WANT TO SKIP EXCEPT …Run Code Online (Sandbox Code Playgroud) 试图解决与此非常相似的问题:
我有以下代码:
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.usda.gov/oce/commodity/wasde/latest.xml')
data = r.text
soup = BeautifulSoup(data, "lxml")
for ce in soup.find_all("Cell"):
print(ce["cell_value1"])
Run Code Online (Sandbox Code Playgroud)
代码运行没有错误,但不会向终端打印任何值。
我想为整个页面提取上面提到的“cell_value1”数据,所以我有这样的东西:
2468.58
3061.58
376.64
and so on...
Run Code Online (Sandbox Code Playgroud)
我的 XML 文件的格式与上述问题的解决方案中的示例相同。我确定了特定于我想要抓取的属性的适当属性标签。为什么这些值没有打印到终端?
我正在尝试beautifulsoup4使用以下命令在我的 mac 中安装:
pip3 install beautifulsoup4
Run Code Online (Sandbox Code Playgroud)
但我收到以下错误:
Could not find a version that satisfies the requirement beautifulsoup4 (from versions: )
No matching distribution found for beautifulsoup4
Run Code Online (Sandbox Code Playgroud)
我该如何解决这个问题?
状态的 XML 数据(file.xml)如下所示
<?xml version="1.0" encoding="UTF-8" standalone="true"?>
<Activity_Logs xsi:schemaLocation="http://www.cisco.com/PowerKEYDVB/Auditing
DailyActivityLog.xsd" To="2018-04-01" From="2018-04-01" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://www.cisco.com/PowerKEYDVB/Auditing">
<ActivityRecord>
<time>2015-09-16T04:13:20Z</time>
<oper>Create_Product</oper>
<pkgEid>10</pkgEid>
<pkgName>BBCWRL</pkgName>
</ActivityRecord>
<ActivityRecord>
<time>2015-09-16T04:13:20Z</time>
<oper>Create_Product</oper>
<pkgEid>18</pkgEid>
<pkgName>CNNINT</pkgName>
</ActivityRecord>
Run Code Online (Sandbox Code Playgroud)
上述 XML 文件的解析和转换为 CSV 将由以下 python 代码完成。
import csv
import xml.etree.cElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
data_to_csv= open('output.csv','w')
list_head=[]
Csv_writer=csv.writer(data_to_csv)
count=0
for elements in root.findall('ActivityRecord'):
List_node = []
if count == 0 :
time = elements.find('time').tag
list_head.append(time)
oper = elements.find('oper').tag
list_head.append(oper)
pkgEid = elements.find('pkgEid').tag
list_head.append(pkgEid)
pkgName = elements.find('pkgName').tag
list_head.append(pkgName)
Csv_writer.writerow(list_head) …Run Code Online (Sandbox Code Playgroud) 我正在从仪表板中抓取一些数据,并坚持尝试将多个数据中的一些数据div classes放入 Pandas 数据框中。我应该如何尝试转换这样的东西:
[<div class="map-item" data-companyname="Apical Group" data-country="INDONESIA" data-district="Jakarta Utara" data-latitude="-6.099396000" data-longitude="106.951478000" data-millname="AAJ Marunda" data-province="Jakarta" data-report="http://naturalhealthytreat.com/sites/neste-daemeter.com/files/AAJ_Marunda.pdf" id="map_item_4645">AAJ Marunda</div>,
<div class="map-item" data-companyname="Apical Group" data-country="INDONESIA" data-district="Lubuk Gaung" data-latitude="1.754005000" data-longitude="101.363532000" data-millname="Sari Dumai Sejati" data-province="Riau" data-report="http://naturalhealthytreat.com/sites/neste-daemeter.com/files/Sari_Dumai_Sejati.pdf" id="map_item_4646">Sari Dumai Sejati</div>,
<div class="map-item" data-companyname="Kutai Refinery Nusantara " data-country="INDONESIA" data-district="Balikpapan" data-latitude="-1.179099000" data-longitude="116.788274000" data-millname="Kutai Refinery Nusantara " data-province="Penajam Paser Utara" data-report="http://naturalhealthytreat.com/sites/neste-daemeter.com/files/Kutai_Refinery_Nusantara_.pdf" id="map_item_4647">Kutai Refinery Nusantara </div>]
Run Code Online (Sandbox Code Playgroud)
变成这样的数据框:
no companyname country district latitude longitude millname province report
1 Apical Group INDONESIA Jakarta Utara -6.099396 106.951478 AAJ Marunda …Run Code Online (Sandbox Code Playgroud) 我不明白为什么这不起作用。
soup_main = BeautifulSoup('<html><head></head><body><a>FooBar</a></body></html>')
soup_append = BeautifulSoup('<html><head></head><body><a>Meh</a></body></html>')
soup_main.body.append(soup_append.a)
Run Code Online (Sandbox Code Playgroud)
我收到以下错误:
Traceback (most recent call last):File "<stdin>", line 1, in <module>
File "C:\Python34\lib\site-packages\bs4\element.py", line 378, in append
self.insert(len(self.contents), tag)
File "C:\Python34\lib\site-packages\bs4\element.py", line 312, in insert
raise ValueError("Cannot insert None into a tag.")
ValueError: Cannot insert None into a tag.
Run Code Online (Sandbox Code Playgroud)
如果我能理解正在发生的事情,我会很高兴。
我正在尝试使用请求和beautifulsoup 抓取页面的所有HTML 元素。我正在使用 ASIN(亚马逊标准识别码)来获取页面的产品详细信息。我的代码如下:
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = urlopen(url)
soup = BeautifulSoup(response, "html.parser")
print(soup)
Run Code Online (Sandbox Code Playgroud)
但输出并未显示页面的整个 HTML,因此我无法进一步处理产品详细信息。这有什么帮助吗?
编辑 1:
从给定的答案中,它显示了机器人检测页面的标记。我研究了一下并找到了两种破坏它的方法:
我用 bs4 提取了一大块 html,如下所示
<div class="a-section a-spacing-small" id="productDescription">
<!-- show up to 2 reviews by default -->
<p>Satin Smooth Universal Protective Wax Pot Collars by Satin Smooth</p>
</div>
Run Code Online (Sandbox Code Playgroud)
提取我使用的文本 text.strip()
output.text()
Run Code Online (Sandbox Code Playgroud)
它给了我输出 "TypeError: 'str' object is not callable"
当我使用output.get_text()and 时output.getText(),我得到了想要的文本
这3个有什么区别?为什么 get_text() 和 getText() 给出相同的输出?
我Python 3使用BeautifulSoup库创建了一个脚本。它的作用是duckduckgo使用以下网址进入搜索引擎:https://duckduckgo.com/?q=searchterm然后,它将向我显示第一页中的所有网站。
这是代码,它运行良好:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://duckduckgo.com/html/?q=test')
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('a', attrs={'class':'result__a'})
i = 0
while i < len(results):
link = results[i]
url = link['href']
print(url)
i = i + 1
Run Code Online (Sandbox Code Playgroud)
问题是,我没有以正确的格式获取网址(例如:https://www.google.com)。相反,我以搜索查询的格式获取所有网址。
这是我test在duckduckgo上搜索时的意思:
/l/?kh=-1&uddg=https%3A%2F%2Fduckduckgo.com%2Fy.js%3Fu3%3Dhttps%253A%252F%252Fr.search.yahoo.com%252Fcbclk%252FdWU9MEQwQzVENEZDNDU0NDlEMyZ1dD0xNTM4MzE4MTI3MzE5JnVvPTc3NTg0MzM1OTYxMTUyJmx0PTImZXM9ZVBGTU9iWUdQUy42cVdRVQ%252D%252D%252FRV%253D2%252FRE%253D1538346927%252FRO%253D10%252FRU%253Dhttps%25253a%25252f%25252fwww.bing.com%25252faclick%25253fld%25253dd3peyDLOVSWraifG78tpZ1GjVUCUzCMDkx%252DfJrFXeY2IfiXIwUmngX%252DYKvZWQ6q7hPHC_3kc%252DzBWS1SE015Or2c3CncFMVc9OjVV5OyB2kJqXdRsOzRnaCGy8gYCPuival0gLe7WCkfk_%252DAVKTWmYxranfh02ficTC7i6oC38n2q9U9KPe%252526u%25253dhttps%2525253a%2525252f%2525252fwww.dotdrugconsortium.com%2525252f%2525253futm_source%2525253dbing%25252526utm_medium%2525253dcpc%25252526utm_campaign%2525253dadcenter%25252526utm_term%2525253ddottest%252526rlid%25253d590f68ae34ff126ed0e3331eebd0c4fb%252FRK%253D2%252FRS%253DeKe3rY19jdg9vb_ayBSboMzPU1g%252D%26ad_provider%3Dyhs%26vqd%3D3%2D12729109948094676568590283448597440227%2D122882305188756590950269013545136161936
/l/?kh=-1&uddg=https%3A%2F%2Fwww.merriam%2Dwebster.com%2Fdictionary%2Ftest
/l/?kh=-1&uddg=https%3A%2F%2Fwww.speedtest.net%2F
/l/?kh=-1&uddg=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FTest
/l/?kh=-1&uddg=https%3A%2F%2Fwww.dictionary.com%2Fbrowse%2Ftest
/l/?kh=-1&uddg=https%3A%2F%2Fwww.thefreedictionary.com%2Ftest
/l/?kh=-1&uddg=https%3A%2F%2Fwww.16personalities.com%2F
/l/?kh=-1&uddg=https%3A%2F%2Fwww.speakeasy.net%2Fspeedtest%2F
/l/?kh=-1&uddg=http%3A%2F%2Fwww.humanmetrics.com%2Fcgi%2Dwin%2Fjtypes2.asp
/l/?kh=-1&uddg=https%3A%2F%2Fwww.typingtest.com%2F%3Fab
/l/?kh=-1&uddg=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FTest_cricket
/l/?kh=-1&uddg=https%3A%2F%2Fged.com%2F
/l/?kh=-1&uddg=http%3A%2F%2Fspeedtest.xfinity.com%2F
/l/?kh=-1&uddg=https%3A%2F%2Fwww.16personalities.com%2Ffree%2Dpersonality%2Dtest
/l/?kh=-1&uddg=https%3A%2F%2Fwww.merriam%2Dwebster.com%2Fthesaurus%2Ftest
/l/?kh=-1&uddg=http%3A%2F%2Ftest%2Dipv6.com%2F
/l/?kh=-1&uddg=https%3A%2F%2Fwww.thesaurus.com%2Fbrowse%2Ftest
/l/?kh=-1&uddg=http%3A%2F%2Fspeedtest.att.com%2Fspeedtest%2F
/l/?kh=-1&uddg=http%3A%2F%2Fspeedtest.googlefiber.net%2F
/l/?kh=-1&uddg=http%3A%2F%2Ftest.salesforce.com%2F
/l/?kh=-1&uddg=https%3A%2F%2Fmy.uscis.gov%2Fprep%2Ftest%2Fcivics
/l/?kh=-1&uddg=https%3A%2F%2Fwww.tests.com%2F
/l/?kh=-1&uddg=https%3A%2F%2Fen.wiktionary.org%2Fwiki%2FTest
/l/?kh=-1&uddg=https%3A%2F%2Ftestmy.net%2F
/l/?kh=-1&uddg=https%3A%2F%2Fwww.google.com%2F
/l/?kh=-1&uddg=https%3A%2F%2Fwww.queendom.com%2Ftests%2Findex.htm
/l/?kh=-1&uddg=http%3A%2F%2Fwww.yourdictionary.com%2Ftest …Run Code Online (Sandbox Code Playgroud) 我想在 PyCharm 2018.3.2 中使用 BeautifulSoup4。问题是,“bs4”和“BeautifulSoup”/“BeautifulSoup4”在编写时在PyCharm中带有红色下划线:
from bs4 import BeautifulSoup (or BeautifulSoup4)
Run Code Online (Sandbox Code Playgroud)
没有其他东西无法导入,只有这个模块。红色下划线告诉我“bs4”和“BeautifulSoup(4)”是一样的:
“未解决的引用 'bs4' 少... (Ctrl+F1)
检查信息:此检查检测应该解析但不解析的名称。由于动态调度和鸭子类型,这在有限但有用的情况下是可能的。比实例项更好地支持顶级和类级项"
当我运行 PyCharm 时,错误显示:“ModuleNotFoundError: No module named 'bs4'”
在 cmd pip3 中已经正确安装了所有东西。不得不卸载并安装几次才能确定。Atm 安装后如下所示:
还有这个:
pip check bs4
pip check beautifulsoup4
Run Code Online (Sandbox Code Playgroud)
说:“没有发现损坏的要求。”
我对 python 相当陌生,但我在网上搜索了答案,但没有找到“未解决的参考”。以为可能是路径惹的祸,但最近安装的请求和模块放在我的存档中的同一个文件夹中,因此 PyCharm 应该能够在能够找到请求时找到 bs4。
beautifulsoup ×10
python ×7
python-3.x ×5
html ×2
pandas ×2
web-scraping ×2
csv ×1
elementtree ×1
javascript ×1
lxml ×1
macos ×1
pycharm ×1
python-2.7 ×1
python-3.7 ×1
xml.etree ×1