标签: beautifulsoup

用 Python 下载文件（带请求？）

我想要做的是构建一个简单的爬虫来帮助我从 Ultimate-Guitar 下载吉他谱。我可以为它提供一个乐队的 URL，它会抓取所有列为“Guitar Pro”标签的标签的链接。

一个典型的链接如下所示：

https://tabs.ultimate-guitar.com/a/agalloch/you_were_but_a_ghost_in_my_arms_guitar_pro.htm

我可以使用此链接做的是使用以下代码找到 tab_id：

for tabid in tab.findAll("input", {"type" : "hidden", "name" : "id", "id" : "tab_id"}):
        tabID = tabid.get("value")

Run Code Online (Sandbox Code Playgroud)

我正在尝试做的是使用它来构建指向实际下载的链接。我遇到问题的地方在这里。我可以构建的最佳链接如下所示：

https://tabs.ultimate-guitar.com/tabs/download?id=904610

请注意，该 URL 末尾的 id 是我之前提到的 tab_id。

如果在浏览器中输入此链接将立即导致下载。我遇到问题的地方是我找不到任何方法来生成依赖于实际文件名的链接。此文件名应该类似于 [此处的歌曲名称].gp5。其他可接受的文件类型可能是 .gpx、.gp4 和 .gp3。

我想要做的是获取实际文件名，以便我可以正确保存文件（如果下载被命名为垃圾文件，例如 ID，这对我没有帮助，因为这对我来说是一个无用的文件名，我显然需要适当的扩展）。有没有办法获取上面的链接并正确初始化下载，或者我可能不走运？我确定有一种方法可以满足我的需求，只是我对这类事情没有足够的经验。我对请求和诸如此类的东西一无所知，所以也许可以提供此 URL 的内容并获得下载作为回报？

注意：如果获取实际文件名和扩展名太困难，我确实有解决方法的想法，但我显然至少需要适当的扩展名。

python beautifulsoup web-crawler python-requests

Sho*_*269

lucky-day

0
推荐指数

1
解决办法

5255
查看次数

抓取汉字python

我从https://automatetheboringstuff.com学会了如何废弃网站。我想报废http://www.piaotian.net/html/3/3028/1473227.html，里面的内容是中文的，写成.txt文件。但是，.txt 文件包含随机符号，我认为这是编码/解码问题。

我读过这个线程“如何使用 python 解码和编码网页？ ”并认为我的网站的编码方法是“gb2312”和“windows-1252”。我尝试在这两种编码方法中解码但失败了。

有人可以向我解释我的代码的问题吗？我对编程很陌生，所以也请让我知道我的误解！

此外，当我从代码中删除“html.parser”时，.txt 文件原来是空的，而不是至少有符号。为什么会这样？

import bs4, requests, sys

reload(sys)
sys.setdefaultencoding("utf-8")

novel = requests.get("http://www.piaotian.net/html/3/3028/1473227.html")
novel.raise_for_status()

novelSoup = bs4.BeautifulSoup(novel.text, "html.parser")

content = novelSoup.select("br")

novelFile = open("novel.txt", "w")
for i in range(len(content)):
    novelFile.write(str(content[i].getText()))

Run Code Online (Sandbox Code Playgroud)

encoding beautifulsoup decoding web-scraping python-2.7

作者

2017 05-23

0
推荐指数

1
解决办法

2519
查看次数

获取页面[美汤]中所有标签的所有属性

我想通过数组中的美丽汤获取html页面中每个标签的所有属性

例如我有一个 html 页面我想要一个字符串数组中的所有标签属性

<div att0="content1">
<a href="link1">link data</a>
</div>

Run Code Online (Sandbox Code Playgroud)

结果将是：[content1, link1]

python beautifulsoup

rez*_*eza

2017 01-11

0
推荐指数

1
解决办法

7865
查看次数

使用正则表达式匹配一个用 beautifulsoup 解析的属性值

你好，来自笨拙的地方，

我正在尝试解析一个论坛。更具体地说，线程的名称。

这些线程由论坛引擎 (vbulletin) 提供，因为这样

<a href="http://www.example.com/showthread.php?t=555555" id="thread_title_555555">NAME OF THE TITLE</a>

Run Code Online (Sandbox Code Playgroud)

使用python和beautifulsoup，我在其他领域取得了成功。但是，我无法使用正则表达式解析“id”属性。我需要解析器的这些行找到每个具有六位数 id 的“a”元素并从中获取文本

像这样的东西

for elements in soup.findAll("a"):
    if re.match("thread_title_", element['id']) is not None:
        print element.text

Run Code Online (Sandbox Code Playgroud)

或在伪python中：

for elements in soup.finAll("a", {"id": "thread_title_".*}):
    print element.text

Run Code Online (Sandbox Code Playgroud)

我尝试了数十种变体，但无济于事。我能做什么？

提前致谢

python regex beautifulsoup

Jua*_*lla

2017 06-06

0
推荐指数

1
解决办法

3958
查看次数

如何使用BeautifulSoup循环浏览用于网页抓取的网址列表

有谁知道如何通过 Beautifulsoup 从同一个网站上抓取 url 列表？list = ['url1', 'url2', 'url3'...]

================================================== ========================

我提取网址列表的代码：

url = 'http://www.hkjc.com/chinese/racing/selecthorsebychar.asp?ordertype=2'
url1 = 'http://www.hkjc.com/chinese/racing/selecthorsebychar.asp?ordertype=3'
url2 = 'http://www.hkjc.com/chinese/racing/selecthorsebychar.asp?ordertype=4'

r  = requests.get(url)
r1  = requests.get(url1)
r2  = requests.get(url2)

data = r.text
soup = BeautifulSoup(data, 'lxml')
links = []

for link in soup.find_all('a', {'class': 'title_text'}):
    links.append(link.get('href'))

data1 = r1.text

soup = BeautifulSoup(data1, 'lxml')

for link in soup.find_all('a', {'class': 'title_text'}):
    links.append(link.get('href'))

data2 = r2.text

soup = BeautifulSoup(data2, 'lxml')

for link in soup.find_all('a', {'class': 'title_text'}):
    links.append(link.get('href'))

new = ['http://www.hkjc.com/chinese/racing/']*1123

url_list = …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup

yeu*_*ase

2017 07-01

0
推荐指数

1
解决办法

1万
查看次数

BeautifulSoup 类型错误

我遇到了 BeautifulSoup（更具体地说是 xml 解析器）的问题，其中似乎将“名称”作为标签属性重载了一些底层函数。

鉴于以下代码：

#!/usr/bin/env python3

from bs4 import BeautifulSoup

siteconfig="""
<?xml version="1.0" encoding="utf-8"?>
<sites version="180201">
  <site name="au" location="oceana">
    <addresslist="IPv4">
      <address>192.168.1.10/32</address>
      <address>192.168.2.10/32</address>
    </addresslist>
    <addresslist="IPv6">
      <address>fc00:07bc:5ae6:75d0::26/128</address>
      <address>fc00:07bc:5ae6:75d1::26/128</address>
    </addresslist>
  </site>
  <site name="us" location="americas">
    <addresslist="IPv4">
      <address>192.168.4.13/32</address>
      <address>192.168.5.13/32</address>
    </addresslist>
    <addresslist="IPv6">
      <address>fc00:07bc:5ae6:75d0::45/128</address>
      <address>fc00:07bc:5ae6:75d1::45/128</address>
    </addresslist>
  </site>
</sites>
"""
soup = BeautifulSoup(siteconfig,"xml")
print(soup.find("site", name="us"))

Run Code Online (Sandbox Code Playgroud)

我收到以下错误：

Traceback (most recent call last):
  File "./siteConfig.py", line 33, in <module>
    print(soup.find("site", name="us"))
TypeError: find() got multiple values for argument 'name'

Run Code Online (Sandbox Code Playgroud)

但是，如果我将最后一行更改为：

print(soup.find("site", location="americas"))

Run Code Online (Sandbox Code Playgroud)

我得到以下输出：

<site location="americas" name="us">
  <addresslist>="IPv4"&gt;
    <address>192.168.4.13/32</address>
    <address>192.168.5.13/32</address> …

Run Code Online (Sandbox Code Playgroud)

python xml beautifulsoup

Ben*_*ale

lucky-day

0
推荐指数

1
解决办法

564
查看次数

用python从网上抓取表格

我正在尝试从该网站获取洞表（所有 1000 多所大学） - https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/25 /sort_by/rank/sort_order/asc/cols/scores。

为了这个目标，我使用了以下库 - requests 和 BeautifulSoup，我的代码是：

import requests
from bs4 import BeautifulSoupenter 

html_content = requests.get('https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats')
soup = bs4.BeautifulSoup(html_content, 'lxml')

Run Code Online (Sandbox Code Playgroud)

然后我在找一张桌子：

table = soup.find_all('table')[0]

Run Code Online (Sandbox Code Playgroud)

但结果，我看不到表本身<tbody>、行<tr>和列<td>。

HTML代码：

请帮助米？从该站点获取所有信息并从中构建数据框。

python parsing beautifulsoup html-parsing web-scraping

作者

2018 05-04

0
推荐指数

1
解决办法

676
查看次数

如何使用美丽的汤从网站下载图像？

我想从网站上保存图像，是否可以在 Python 中使用漂亮的汤库。我们需要枕头库吗？或者我们可以将它们转换为 numpy 数组并使用 open CV 处理 tem 吗？

python image beautifulsoup web-scraping

Sha*_*sad

lucky-day

0
推荐指数

1
解决办法

5546
查看次数

从维基百科表格中抓取数据

我只是想将维基百科表中的数据抓取到熊猫数据框中。

我需要重现三列：“邮政编码、自治市镇、社区”。

import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'xml')
print(soup.prettify())

My_table = soup.find('table',{'class':'wikitable sortable'})
My_table

links = My_table.findAll('a')
links

Neighbourhood = []
for link in links:
    Neighbourhood.append(link.get('title'))

print (Neighbourhood)

import pandas as pd
df = pd.DataFrame([])
df['PostalCode', 'Borough', 'Neighbourhood'] = pd.Series(Neighbourhood)

df

Run Code Online (Sandbox Code Playgroud)

它只返回自治市镇......

谢谢

python wikipedia beautifulsoup pandas

Inf*_*evo

2019 02-27

0
推荐指数

1
解决办法

4142
查看次数

BeautifulSoup 不返回它应该返回的标签（空结果）

我正在尝试使用 Beautifulsoup Python 从网站上抓取一些数据，但它没有返回它应该返回的值。以下是我的代码。

import requests
from bs4 import BeautifulSoup

url = 'https://finance.naver.com/item/sise.nhn?code=005930'

# send a HTTP request to the URL of the webpage I want to access
r = requests.get(url)

data = r.text

# making the soup
soup = BeautifulSoup(data, 'html.parser')

print(soup.find('iframe', attrs={'title': '?? ??'}))

Run Code Online (Sandbox Code Playgroud)

它返回，

<iframe bottommargin="0" frameborder="0" height="360" marginheight="0" name="day" scrolling="no" src="/item/sise_day.nhn?code=005930" title="?? ??" topmargin="0" width="100%"></iframe>

Run Code Online (Sandbox Code Playgroud)

打印结果中不包含 HTML 标签。但是，如果我查看网页上的开发人员工具，它清楚地显示“iframe”标签中有很多标签。