标签: beautifulsoup

在Python中使用BeautifulSoup查找字符串

我需要从这些字符串中提取"/ html/path":

generic/html/path/generic/generic/generic

Run Code Online (Sandbox Code Playgroud)

我只需要"路径",它总是在"html /"之后.所以有一种方法可以搜索"html /"并获取字符串,直到"/"即将到来？

python beautifulsoup

Mic*_*ael

2012 11-13

0
推荐指数

1
解决办法

143
查看次数

无法在python 2.7和selenium中导入`beautifulSoup`

我正在尝试导入beautifulSoup但是收到错误.请你告诉我为什么这样或指导我解决同样的问题？

Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\Arup Rakshit>python
'python' is not recognized as an internal or external command,
operable program or batch file.

C:\Users\Arup Rakshit>ipython
'ipython' is not recognized as an internal or external command,
operable program or batch file.

C:\Users\Arup Rakshit>cd..

C:\Users>cd..

C:\>cd Python27

C:\Python27>cd C:\Python27\selenv\Scripts

C:\Python27\selenv\Scripts>my_selenium_script.py
hello

C:\Python27\selenv\Scripts>python
Python 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" …

Run Code Online (Sandbox Code Playgroud)

python selenium beautifulsoup

Cod*_*ver

2012 12-27

0
推荐指数

1
解决办法

7514
查看次数

BeautifulSoup,解析和编写文本文件中的数据

from bs4 import BeautifulSoup


soup = BeautifulSoup(open("youtube.htm"))

for link in soup.find_all('img'):
    print  link.get('src')



file = open("parseddata.txt", "wb")
file.write(link.get('src')+"\n")
file.flush()

Run Code Online (Sandbox Code Playgroud)

您好,我想尝试使用BeautifulSoup并解析一些youtube网站.它得到了这条线路有25条线路.但是,如果我查看文件,那么只写了最后一个(其中一小部分).我尝试了不同的打开模式,或者file.close()函数.但没有任何效果.有人知道了吗？

python io file beautifulsoup

Jon*_*hon

2014 09-28

0
推荐指数

1
解决办法

1万
查看次数

来自使用Firefox的网站的BeautifulSoup和div值

经过一些改变,我得到了:

import urllib2
from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen("http://example.com"))

soup.find("div", {"id": "botloc"})

elem = soup.find('div')
print elem['id'], 'is the id'
print elem.text, 'is the value'

Run Code Online (Sandbox Code Playgroud)

所以最后我写了正确的代码(有论坛帮助),但回复的价值是错误的,因为它从谷歌chrome获取它!任何想法如何从Firefox获得div值？(我在firefox浏览器上的服务器上)我赞成每个提示

python beautifulsoup python-2.7

Mez*_*ith

2013 05-16

0
推荐指数

1
解决办法

118
查看次数

美丽的汤4 find_all没有找到美丽的汤3找到的链接

我注意到一个非常烦人的错误:BeautifulSoup4(包:) bs4经常找到比以前版本(包:)更少的标签BeautifulSoup.

这是该问题的可重现实例:

import requests
import bs4
import BeautifulSoup

r = requests.get('http://wordpress.org/download/release-archive/')
s4 = bs4.BeautifulSoup(r.text)
s3 = BeautifulSoup.BeautifulSoup(r.text)

print 'With BeautifulSoup 4 : {}'.format(len(s4.findAll('a')))
print 'With BeautifulSoup 3 : {}'.format(len(s3.findAll('a')))

Run Code Online (Sandbox Code Playgroud)

输出:

With BeautifulSoup 4 : 557
With BeautifulSoup 3 : 1701

Run Code Online (Sandbox Code Playgroud)

你可以看到,差异并不小.

以下是模块的确切版本,以防有人想知道:

In [20]: bs4.__version__
Out[20]: '4.2.1'

In [21]: BeautifulSoup.__version__
Out[21]: '3.2.1'

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup web-scraping web

hal*_*ngs

lucky-day

0
推荐指数

1
解决办法

2886
查看次数

我如何在使用BeautifulSoup的Python链接后获取文本？

我知道怎么去找到所有链接,但我想在链接后立即发送文本.

例如,在给定的html中:

<p><a href="/cgi-bin/bdquery/?&amp;Db=d106&amp;querybd=@FIELD(FLD004+@4((@1(Rep+Armey++Richard+K.))+00028))">Rep Armey, Richard K.</a> [TX-26]
 - 11/9/1999
<br/><a href="/cgi-bin/bdquery/?&amp;Db=d106&amp;querybd=@FIELD(FLD004+@4((@1(Rep+Davis++Thomas+M.))+00274))">Rep Davis, Thomas M.</a> [VA-11]
 - 11/9/1999
<br/><a href="/cgi-bin/bdquery/?&amp;Db=d106&amp;querybd=@FIELD(FLD004+@4((@1(Rep+DeLay++Tom))+00282))">Rep DeLay, Tom</a> [TX-22]
 - 11/9/1999

Run Code Online (Sandbox Code Playgroud)

......(这重复了很多次)

我想提取[CA-28] - 11/9/1999与之相关的内容<a href=... >Rep Dreier, David</a>

并为列表中的所有链接执行此操作

python beautifulsoup

and*_*voy

lucky-day

0
推荐指数

1
解决办法

806
查看次数

Python:另一个'NoneType'对象没有属性错误

对于新手练习,我试图在html文件中找到元标记并提取生成器,所以我喜欢这样:

Version = soup.find("meta", {"name":"generator"})['content']

Run Code Online (Sandbox Code Playgroud)

因为我有这个错误:

TypeError: 'NoneType' object has no attribute '__getitem__'

Run Code Online (Sandbox Code Playgroud)

我以为使用异常会纠正它,所以我写道:

try: Version = soup.find("meta", {"name":"generator"})['content']

except NameError,TypeError:

     print "Not found"

Run Code Online (Sandbox Code Playgroud)

而我得到的是同样的错误.

那我该怎么办？

python error-handling beautifulsoup

4m1*_*4j1

2013 11-07

0
推荐指数

1
解决办法

7210
查看次数

使用python将http://xxx.xx/添加到字符串中

如何在下面的结果中添加" http://test.url/ " link.get('href'),但前提是它不包含"http"

import urllib2
from bs4 import BeautifulSoup

url1 = "http://www.salatomatic.com/c/Sydney+168"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1)
for link in soup.findAll('a'):
  print link.get('href')

Run Code Online (Sandbox Code Playgroud)

python string beautifulsoup

Oss*_*ama

2013 12-07

0
推荐指数

1
解决办法

1206
查看次数

BeautifulSoup .hyperlinks如何运作？

目前我正在分析来自其他人的代码,现在我正在弄清楚BeautifulSoup.hyperlinks变量必须具备的内容.有谁知道这方面的文件？我在官方网站上找不到任何东西.问题是当我打印soup.hyperlinks时,下面的代码给出'None':

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="intermezzo">this is a link: http://www.link.nl/
<a href="http://www.link.nl" title="link title" target="link target" class="link class">link label</a>
</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc)

print soup.hyperlinks

Run Code Online (Sandbox Code Playgroud)

我希望有人可以帮助我吗？

python beautifulsoup

Jel*_*ema

lucky-day

0
推荐指数

1
解决办法

74
查看次数

Python-如何解决TypeError

 import urllib, urllib2
 from bs4 import BeautifulSoup, Comment
 url='http://www.amazon.in/product-reviews/B00EJBA7HC/ref=cm_cr_pr_top_link_1?ie=UTF8&pageNumber=1&showViewpoints=0&sortBy=bySubmissionDateDescending'
 content = urllib2.urlopen(url).read()
 soup = BeautifulSoup(content, "html.parser")
 fooId = soup.find('input',name='ASIN',type='hidden') #Find the proper tag
 value = fooId['value']
 print value

Run Code Online (Sandbox Code Playgroud)

我需要此代码从给定的URL打印产品的ASIN ID.

相反,我收到以下错误:

TypeError: find() got multiple values for keyword argument 'name'

Run Code Online (Sandbox Code Playgroud)

请帮忙.

python screen-scraping beautifulsoup python-2.7

kes*_*106

2014 01-29

0
推荐指数

1
解决办法

660
查看次数

标签统计

beautifulsoup ×10

python ×10

python-2.7 ×2

error-handling ×1

file ×1

io ×1

screen-scraping ×1

selenium ×1

string ×1

web ×1

web-scraping ×1

标签 统计

标签统计