我需要从这些字符串中提取"/ html/path":
generic/html/path/generic/generic/generic
Run Code Online (Sandbox Code Playgroud)
我只需要"路径",它总是在"html /"之后.所以有一种方法可以搜索"html /"并获取字符串,直到"/"即将到来?
我正在尝试导入beautifulSoup但是收到错误.请你告诉我为什么这样或指导我解决同样的问题?
Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.
C:\Users\Arup Rakshit>python
'python' is not recognized as an internal or external command,
operable program or batch file.
C:\Users\Arup Rakshit>ipython
'ipython' is not recognized as an internal or external command,
operable program or batch file.
C:\Users\Arup Rakshit>cd..
C:\Users>cd..
C:\>cd Python27
C:\Python27>cd C:\Python27\selenv\Scripts
C:\Python27\selenv\Scripts>my_selenium_script.py
hello
C:\Python27\selenv\Scripts>python
Python 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" …Run Code Online (Sandbox Code Playgroud) from bs4 import BeautifulSoup
soup = BeautifulSoup(open("youtube.htm"))
for link in soup.find_all('img'):
print link.get('src')
file = open("parseddata.txt", "wb")
file.write(link.get('src')+"\n")
file.flush()
Run Code Online (Sandbox Code Playgroud)
您好,我想尝试使用BeautifulSoup并解析一些youtube网站.它得到了 这条线路有25条线路.但是,如果我查看文件,那么只写了最后一个(其中一小部分).我尝试了不同的打开模式,或者file.close()函数.但没有任何效果.有人知道了吗?
经过一些改变,我得到了:
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen("http://example.com"))
soup.find("div", {"id": "botloc"})
elem = soup.find('div')
print elem['id'], 'is the id'
print elem.text, 'is the value'
Run Code Online (Sandbox Code Playgroud)
所以最后我写了正确的代码(有论坛帮助),但回复的价值是错误的,因为它从谷歌chrome获取它!任何想法如何从Firefox获得div值?(我在firefox浏览器上的服务器上)我赞成每个提示
我注意到一个非常烦人的错误:BeautifulSoup4(包:) bs4经常找到比以前版本(包:)更少的标签BeautifulSoup.
这是该问题的可重现实例:
import requests
import bs4
import BeautifulSoup
r = requests.get('http://wordpress.org/download/release-archive/')
s4 = bs4.BeautifulSoup(r.text)
s3 = BeautifulSoup.BeautifulSoup(r.text)
print 'With BeautifulSoup 4 : {}'.format(len(s4.findAll('a')))
print 'With BeautifulSoup 3 : {}'.format(len(s3.findAll('a')))
Run Code Online (Sandbox Code Playgroud)
输出:
With BeautifulSoup 4 : 557
With BeautifulSoup 3 : 1701
Run Code Online (Sandbox Code Playgroud)
你可以看到,差异并不小.
以下是模块的确切版本,以防有人想知道:
In [20]: bs4.__version__
Out[20]: '4.2.1'
In [21]: BeautifulSoup.__version__
Out[21]: '3.2.1'
Run Code Online (Sandbox Code Playgroud) 我知道怎么去找到所有链接,但我想在链接后立即发送文本.
例如,在给定的html中:
<p><a href="/cgi-bin/bdquery/?&Db=d106&querybd=@FIELD(FLD004+@4((@1(Rep+Armey++Richard+K.))+00028))">Rep Armey, Richard K.</a> [TX-26]
- 11/9/1999
<br/><a href="/cgi-bin/bdquery/?&Db=d106&querybd=@FIELD(FLD004+@4((@1(Rep+Davis++Thomas+M.))+00274))">Rep Davis, Thomas M.</a> [VA-11]
- 11/9/1999
<br/><a href="/cgi-bin/bdquery/?&Db=d106&querybd=@FIELD(FLD004+@4((@1(Rep+DeLay++Tom))+00282))">Rep DeLay, Tom</a> [TX-22]
- 11/9/1999
Run Code Online (Sandbox Code Playgroud)
......(这重复了很多次)
我想提取[CA-28] - 11/9/1999与之相关的内容<a href=... >Rep Dreier, David</a>
并为列表中的所有链接执行此操作
对于新手练习,我试图在html文件中找到元标记并提取生成器,所以我喜欢这样:
Version = soup.find("meta", {"name":"generator"})['content']
Run Code Online (Sandbox Code Playgroud)
因为我有这个错误:
TypeError: 'NoneType' object has no attribute '__getitem__'
Run Code Online (Sandbox Code Playgroud)
我以为使用异常会纠正它,所以我写道:
try: Version = soup.find("meta", {"name":"generator"})['content']
except NameError,TypeError:
print "Not found"
Run Code Online (Sandbox Code Playgroud)
而我得到的是同样的错误.
那我该怎么办?
如何在下面的结果中添加" http://test.url/ " link.get('href'),但前提是它不包含"http"
import urllib2
from bs4 import BeautifulSoup
url1 = "http://www.salatomatic.com/c/Sydney+168"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1)
for link in soup.findAll('a'):
print link.get('href')
Run Code Online (Sandbox Code Playgroud) 目前我正在分析来自其他人的代码,现在我正在弄清楚BeautifulSoup.hyperlinks变量必须具备的内容.有谁知道这方面的文件?我在官方网站上找不到任何东西.问题是当我打印soup.hyperlinks时,下面的代码给出'None':
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="intermezzo">this is a link: http://www.link.nl/
<a href="http://www.link.nl" title="link title" target="link target" class="link class">link label</a>
</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc)
print soup.hyperlinks
Run Code Online (Sandbox Code Playgroud)
我希望有人可以帮助我吗?
import urllib, urllib2
from bs4 import BeautifulSoup, Comment
url='http://www.amazon.in/product-reviews/B00EJBA7HC/ref=cm_cr_pr_top_link_1?ie=UTF8&pageNumber=1&showViewpoints=0&sortBy=bySubmissionDateDescending'
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content, "html.parser")
fooId = soup.find('input',name='ASIN',type='hidden') #Find the proper tag
value = fooId['value']
print value
Run Code Online (Sandbox Code Playgroud)
我需要此代码从给定的URL打印产品的ASIN ID.
相反,我收到以下错误:
TypeError: find() got multiple values for keyword argument 'name'
Run Code Online (Sandbox Code Playgroud)
请帮忙.
beautifulsoup ×10
python ×10
python-2.7 ×2
file ×1
io ×1
selenium ×1
string ×1
web ×1
web-scraping ×1