我有一个具有以下性质的复杂html DOM树:
<table>
...
<tr>
<td>
...
</td>
<td>
<table>
<tr>
<td>
<!-- inner most table -->
<table>
...
</table>
<h2>This is hell!</h2>
<td>
</tr>
</table>
</td>
</tr>
</table>
Run Code Online (Sandbox Code Playgroud)
我有一些逻辑来找出最内层的表.但在找到它之后,我需要获得下一个兄弟元素(h2).无论如何你可以这样做吗?
我想使用BeautifulSoup在html中找到所有表.内表应包含在外表中.
我创建了一些有效的代码,它给出了预期的输出.但是,我不喜欢这种解决方案,因为它会摧毁'汤'对象.
你知道如何以更优雅的方式做到这一点吗?
from BeautifulSoup import BeautifulSoup as bs
input = '''<html><head><title>title</title></head>
<body>
<p>paragraph</p>
<div><div>
<table>table1<table>inner11<table>inner12</table></table></table>
<div><table>table2<table>inner2</table></table></div>
</div></div>
<table>table3<table>inner3</table></table>
<table>table4<table>inner4</table></table>
</html>'''
soup = bs(input)
while(True):
t=soup.find("table")
if t is None:
break
print str(t)
t.decompose()
Output:
<table>table1<table>inner11<table>inner12</table></table></table>
<table>table2<table>inner2</table></table>
<table>table3<table>inner3</table></table>
<table>table4<table>inner4</table></table>
Run Code Online (Sandbox Code Playgroud) YCombinator非常适合提供RSS提要和包含HackerNews顶级项目的大型RSS提要.我正在尝试编写一个python脚本来访问RSS feed文档,然后使用BeautifulSoup解析出某些信息.但是,当BeautifulSoup尝试获取每个项目的内容时,我会遇到一些奇怪的行为.
以下是RSS提要的一些示例行:
<rss version="2.0">
<channel>
<title>Hacker News</title><link>http://news.ycombinator.com/</link><description>Links for the intellectually curious, ranked by readers.</description>
<item>
<title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and 'Notch'</title>
<link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch</link>
<comments>http://news.ycombinator.com/item?id=4944322</comments>
<description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>
</item>
<item>
<title>Two Billion Pixel Photo of Mount Everest (can you find the climbers?)</title>
<link>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_050112_8bit_FLAT.html</link>
<comments>http://news.ycombinator.com/item?id=4943361</comments>
<description><![CDATA[<a href="http://news.ycombinator.com/item?id=4943361">Comments</a>]]></description>
</item>
...
</channel>
</rss>
Run Code Online (Sandbox Code Playgroud)
这是我写(在python)代码访问此饲料和打印出来的title
,link
和comments
每个项目:
import sys
import requests
from bs4 import BeautifulSoup
request = requests.get('http://news.ycombinator.com/rss')
soup = BeautifulSoup(request.text) …
Run Code Online (Sandbox Code Playgroud) 我有一个非常大的XML文件(确切地说是20GB,是的,我需要所有这些).当我尝试加载文件时,收到此错误:
Python(23358) malloc: *** mmap(size=140736680968192) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
File "file.py", line 5, in <module>
code = xml.read()
MemoryError
Run Code Online (Sandbox Code Playgroud)
这是我当前的代码,用于读取XML文件:
from bs4 import BeautifulSoup
xml = open('pages_full.xml', 'r')
code = xml.read()
xml.close()
soup = BeautifulSoup(code)
Run Code Online (Sandbox Code Playgroud)
现在,我将如何消除此错误并继续处理脚本.我会尝试将文件拆分成单独的文件,但由于我不知道这将如何影响BeautifulSoup以及XML数据,我宁愿不这样做.
(XML数据是我志愿使用的wiki的数据库转储,使用它来导入来自不同时间段的数据,使用来自许多页面的直接信息)
我很难在Windows上安装BeautifulSoup.到目前为止,我有:
将BeautifulSoup下载到"我的下载".
在下载文件夹中解压缩/解压缩.
在命令提示符下,我运行:
C:<path to python33> "C:path to beautiful soup\setup.py" install
Run Code Online (Sandbox Code Playgroud)该过程生成了消息:
running install
running build
running build_py
**error: package directory 'bs4' does not exist.**
Run Code Online (Sandbox Code Playgroud)
然而,在上面引用的BeautifulSoup路径中,确实有文件夹bs4
.我错过了什么?
我有一个简单的代码,如:
p = soup.find_all("p")
paragraphs = []
for x in p:
paragraphs.append(str(x))
Run Code Online (Sandbox Code Playgroud)
我试图转换我从xml获得的列表并将其转换为字符串.我想保留它的原始标签,以便我可以重用一些文本,因此我之所以如此追加它.但是该列表包含超过6000个观测值,因此由于str而发生递归错误:
"RuntimeError:调用Python对象时超出了最大递归深度"
我读到你可以改变最大递归但这样做并不明智.我的下一个想法是将转换分为500个批次,但我相信必须有更好的方法来做到这一点.有人有建议吗?
1 /我正在尝试使用美丽的汤提取脚本的一部分,但它打印无.怎么了 ?
URL = "http://www.reuters.com/video/2014/08/30/woman-who-drank-restaurants-tainted-tea?videoId=341712453"
oururl= urllib2.urlopen(URL).read()
soup = BeautifulSoup(oururl)
for script in soup("script"):
script.extract()
list_of_scripts = soup.findAll("script")
print list_of_scripts
Run Code Online (Sandbox Code Playgroud)
2 /目标是提取属性"transcript"的值:
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "VideoObject",
"video": {
"@type": "VideoObject",
"headline": "Woman who drank restaurant's tainted tea hopes for industry...",
"caption": "Woman who drank restaurant's tainted tea hopes for industry...",
"transcript": "Jan Harding is speaking out for the first time about the ordeal that changed her life. SOUNDBITE: JAN HARDING, DRANK TAINTED TEA, SAYING: \"Immediately my …
Run Code Online (Sandbox Code Playgroud) 我正在使用Python制作PDF Web Scraper.基本上,我正试图从我的一个课程中获取所有课程笔记,这些课程都是PDF格式的.我想输入一个网址,然后获取PDF并将其保存在笔记本电脑的目录中.我看了几个教程,但我不完全确定如何去做.StackOverflow上的所有问题似乎都没有帮助我.
这是我到目前为止:
import requests
from bs4 import BeautifulSoup
import shutil
bs = BeautifulSoup
url = input("Enter the URL you want to scrape from: ")
print("")
suffix = ".pdf"
link_list = []
def getPDFs():
# Gets URL from user to scrape
response = requests.get(url, stream=True)
soup = bs(response.text)
#for link in soup.find_all('a'): # Finds all links
# if suffix in str(link): # If the link ends in .pdf
# link_list.append(link.get('href'))
#print(link_list)
with open('CS112.Lecture.09.pdf', 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del …
Run Code Online (Sandbox Code Playgroud) 我正在使用beautifulsoup从页面获取所有链接.我的代码是:
import requests
from bs4 import BeautifulSoup
url = 'http://www.acontecaeventos.com.br/marketing-promocional-sao-paulo'
r = requests.get(url)
html_content = r.text
soup = BeautifulSoup(html_content, 'lxml')
soup.find_all('href')
Run Code Online (Sandbox Code Playgroud)
我得到的只是:
[]
Run Code Online (Sandbox Code Playgroud)
如何获取该页面上所有href链接的列表?
我是webscraping的新手,似乎有两种方法来收集我正在寻找的所有html数据.
option_1 = soup.find_all('div', class_='p')
option_2 = soup.select('div.p')
Run Code Online (Sandbox Code Playgroud)
我看到option_1返回类'bs4.element.ResultSet'而option_2返回类'list'
我仍然可以使用for循环遍历option_1,那么有什么区别:
beautifulsoup ×10
python ×10
web-scraping ×2
xml ×2
find ×1
html-parsing ×1
mediawiki ×1
pdf ×1
python-2.7 ×1
rss ×1
siblings ×1