我有一个 HTML,其中有多个选择标签和每个选择标签下的多个下拉选项我想解析每个选择下的所有选项并存储它们
这就是 html 的样子
<select name="primary_select">
<option></option>
<option></option>
</select>
<select name="secondary_select">
<option></option>
<option></option>
</select>
Run Code Online (Sandbox Code Playgroud)
这就是我的代码的样子
我在 python 中使用 beautifulsoup 和 mechanize
汤 = BeautifulSoup(response.get_data())
subject_options = soup.findAll('select', attrs = {'name': 'primary_select'} ).findAll("option")
print subject_options
Run Code Online (Sandbox Code Playgroud)
我收到以下错误
AttributeError: 'ResultSet' object has no attribute 'findAll'
Run Code Online (Sandbox Code Playgroud)
感谢您的帮助:)
next_page = \xe2\x80\x98https://research.stlouisfed.org/fred2/tags/series?et=&pageID=1&t=\'\nopened_url = urllib2.urlopen(next_page).read()\n\nsoup = BeautifulSoup(opened_url)\n\nhrefs = soup.find_all("div",{"class":"col-xs-12 col-sm-10"})\nRun Code Online (Sandbox Code Playgroud)\n\nhrefs现在看起来像这样:
[<div class="col-xs-12 col-sm-10">\\n<a class="series-title" href="/fred2/series/GDPC1" style="font-size:1.2em">Real Gross Domestic Product</a>\\n</div>, <div class="col-xs-12 col-sm-10">\\n<a class="series-title" href="/fred2/series/CPIAUCSL" style="font-size:1.2em">Consumer Price Index for All Urban Consumers: All Items</a>\\n</div>,...
我尝试使用href类似的方法离开那里hrefs[1][\'href\'],但出现以下错误:
Traceback (most recent call last):\n File "<stdin>", line 2, in <module>\n File "/Library/Python/2.7/site-packages/bs4/element.py", line 958, in __getitem__\n return self.attrs[key]\nKeyError: \'href\'\nRun Code Online (Sandbox Code Playgroud)\n\n我只想删除此页面上的所有 18 个链接。我想我可以将每个元素转换为hrefs字符串,然后只将findhref 转换为其中的内容,但这违背了 bs4 的目的。
我在这里有一个示例 html http://cyberrule.netii.net/1.html 我想让我的第一代孩子尝试过这个
nav = soup.find( 'nav' )
child_li = nav.findAll("li", { "class" : "dropdown" })
Run Code Online (Sandbox Code Playgroud)
但这只给我提供了带有类别下拉列表的列表。列表底部缺失。我想将它们全部放入一个数组中以便逐步处理。
对于这部分html代码:
html3= """<a name="definition"> </a>
<h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2>
<hr/>
<div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div>
<p> </p>"""
Run Code Online (Sandbox Code Playgroud)
我将使用 beautifulsoup 来查找 h2 ,其文本等于“内容逻辑定义”和下一个兄弟姐妹。但是beautifulsoup找不到h2。以下是我的代码:
soup = BeautifulSoup(html3, "lxml")
f= soup.find("h2", text = "Content Logical Definition").nextsibilings
Run Code Online (Sandbox Code Playgroud)
这是一个错误:
AttributeError: 'NoneType' object has no attribute 'nextsibilings'
Run Code Online (Sandbox Code Playgroud)
文本中有几个“h2”,但唯一使这个h2独特的字符是“内容逻辑定义”。找到这个 h2 后,我将从表中提取数据并在其下方列出。
我在脚本中使用此函数来请求网页的 BeautifoulSoup 对象:
def getSoup(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36'
}
i = 0
while i == 0:
print '(%s) (INFO) Connecting to: %s ...' % (getTime(), url)
data = requests.get(url, headers=headers).text
soup = BeautifulSoup(data, 'lxml')
if soup == None:
print '(%s) (WARN) Received \'None\' BeautifulSoup object, retrying in 5 seconds ...' % getTime()
time.sleep(5)
else:
i = 1
return soup
Run Code Online (Sandbox Code Playgroud)
这个循环直到我收到一个有效的 BeautifulSoup 对象,但我想我也可以收到一个不完整的网页,但仍然有一个有效的 BeautifulSoup 对象。我想使用类似的东西:
if '</hml>' …Run Code Online (Sandbox Code Playgroud) 我有一个html代码如下:
<div class="_cFb">
<div class="_XWk">Rabindranath Tagore</div>
</div>
Run Code Online (Sandbox Code Playgroud)
我使用以下 python 代码来提取文本内容:
soup.find_all('div', attrs={'class':'._XWk'})
Run Code Online (Sandbox Code Playgroud)
此代码返回空。但是,我可以访问不以下划线(_)开头的其他类属性。有什么想法可以提取标签文本吗?
我希望抓取以下网址中显示的所有图像: happiness
我尝试了很多方法,但只能获取 20 张图像。下面是相同的 Python 代码:
query = input("happiness")# you can change the query for the image here
image_type="ActiOn"
query= query.split()
query='+'.join(query)
url="https://www.google.co.in/search?q="+query+"&source=lnms&tbm=isch"
print(url)
#add the directory for your image here
DIR="Pictures"
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"
}
soup = get_soup(url,header)
if not os.path.exists(DIR):
os.mkdir(DIR)
DIR = os.path.join(DIR, query.split()[0])
if not os.path.exists(DIR):
os.mkdir(DIR)
images = [a['src'] for a in soup.find_all("img", {"src":
re.compile("gstatic.com")})]
print(images)
print("there are total" , len(images),"images")
image_type = "Action"
#print images …Run Code Online (Sandbox Code Playgroud) 我正在尝试运行 beautifulSoup 从网站中提取链接和文本(我已获得许可)
\n\n我运行以下代码来获取链接和文本:
\n\nimport requests\nfrom bs4 import BeautifulSoup \n\nurl = "http://implementconsultinggroup.com/career/#/6257"\nr = requests.get(url)\n\nsoup = BeautifulSoup(r.content)\n\nlinks = soup.find_all("a")\n\nfor link in links:\n if "career" in link.get("href"):\n print "<a href=\'%s\'>%s</a>" %(link.get("href"), link.text)\nRun Code Online (Sandbox Code Playgroud)\n\n这给了我以下输出:
\n\nView Position\n\n</a>\n<a href=\'/career/business-analyst-within-human-capital-management/\'>\nBusiness analyst within human capital management\nCOPENHAGEN \xe2\x80\xa2 We are looking for an ambitious student with an interest in HR \nwho is passionate about working in the cross-field of people management, \nbusiness and technology\n\n\n\n\nView Position\n\n</a>\n<a href=\'/career/management-consultants-within-strategic-workforce-planning/\'>\nManagement consultants within strategic workforce planning\nCOPENHAGEN \xe2\x80\xa2 We are …Run Code Online (Sandbox Code Playgroud) 我正在制作一个简单的脚本,使用 requests 和 BeatifulSoup 将单词从英语翻译成俄语,问题是结果框是空的,应该翻译单词/我不确定是否应该使用 GET 或 POST 方法。这是我尝试过的
with open('File.csv', 'r') as file:
csv_reader = csv.reader(file)
for line in csv_reader:
if line[1] == '':
url = 'https://translate.google.com/#en/ru/{}'.format(line[0])
r = requests.get(url, timeout=5)
soup = BeautifulSoup(r.content, 'html.parser')
translate = soup.find('span', id='result_box')
for word in translate:
print(word.find('span', class_=''))
Run Code Online (Sandbox Code Playgroud) 我想使用 beautifulsoup 和 python 从变量元返回“id”值。这可能吗?此外,我不知道如何找到包含元变量的特定“脚本”标签,因为它没有唯一标识符,以及网站上的许多其他“脚本”标签。我也使用硒,所以我可以理解任何答案。
<script>
var meta = "variants":[{"id":12443604615241,"price":14000},
{"id":12443604648009,"price":14000}]
</script>
Run Code Online (Sandbox Code Playgroud) beautifulsoup ×10
python ×9
web-scraping ×6
html ×2
python-2.7 ×2
html-parsing ×1
image ×1
mechanize ×1
selenium ×1