Ibr*_*ter 8 python beautifulsoup nltk html-parsing web-scraping
我有点像编码新手,我一直试图通过使用Beautiful Soup(用于从HTML中提取数据的Python库)从Rap天才http://genius.com/artists/Andre-3000中删除Andre 3000的歌词.和XML文件).我的最终目标是以字符串格式提供数据.这是我到目前为止:
from bs4 import BeautifulSoup
from urllib2 import urlopen
artist_url = "http://rapgenius.com/artists/Andre-3000"
def get_song_links(url):
html = urlopen(url).read()
# print html
soup = BeautifulSoup(html, "lxml")
container = soup.find("div", "container")
song_links = [BASE_URL + dd.a["href"] for dd in container.findAll("dd")]
print song_links
get_song_links(artist_url)
for link in soup.find_all('a'):
print(link.get('href'))
Run Code Online (Sandbox Code Playgroud)
所以我需要其他代码的帮助.如何将他的歌词变成字符串格式?然后我如何使用自然语言工具包(NLTK)来标记句子和单词.
这是一个示例,如何获取页面上的所有歌曲链接,关注它们并获取歌词:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
BASE_URL = "http://genius.com"
artist_url = "http://genius.com/artists/Andre-3000/"
response = requests.get(artist_url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})
soup = BeautifulSoup(response.text, "lxml")
for song_link in soup.select('ul.song_list > li > a'):
link = urljoin(BASE_URL, song_link['href'])
response = requests.get(link)
soup = BeautifulSoup(response.text)
lyrics = soup.find('div', class_='lyrics').text.strip()
# tokenize `lyrics` with nltk
Run Code Online (Sandbox Code Playgroud)
注意requests这里使用的是 module。另请注意,User-Agent标头是必需的,因为网站返回时403 - Forbidden没有标头。
| 归档时间: |
|
| 查看次数: |
6384 次 |
| 最近记录: |