关于Rap Genius w/Python的Web Scraping Rap歌词

Ibr*_*ter 8 python beautifulsoup nltk html-parsing web-scraping

我有点像编码新手,我一直试图通过使用Beautiful Soup(用于从HTML中提取数据的Python库)从Rap天才http://genius.com/artists/Andre-3000中删除Andre 3000的歌词.和XML文件).我的最终目标是以字符串格式提供数据.这是我到目前为止:

from bs4 import BeautifulSoup
from urllib2 import urlopen

artist_url = "http://rapgenius.com/artists/Andre-3000"

def get_song_links(url):
    html = urlopen(url).read()
    # print html 
    soup = BeautifulSoup(html, "lxml")
    container = soup.find("div", "container")
    song_links = [BASE_URL + dd.a["href"] for dd in container.findAll("dd")]

    print song_links

get_song_links(artist_url)
for link in soup.find_all('a'):
    print(link.get('href'))
Run Code Online (Sandbox Code Playgroud)

所以我需要其他代码的帮助.如何将他的歌词变成字符串格式?然后我如何使用自然语言工具包(NLTK)来标记句子和单词.

ale*_*cxe 4

这是一个示例,如何获取页面上的所有歌曲链接,关注它们并获取歌词:

from urlparse import urljoin
from bs4 import BeautifulSoup
import requests


BASE_URL = "http://genius.com"
artist_url = "http://genius.com/artists/Andre-3000/"

response = requests.get(artist_url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})

soup = BeautifulSoup(response.text, "lxml")
for song_link in soup.select('ul.song_list > li > a'):
    link = urljoin(BASE_URL, song_link['href'])
    response = requests.get(link)
    soup = BeautifulSoup(response.text)
    lyrics = soup.find('div', class_='lyrics').text.strip()

    # tokenize `lyrics` with nltk
Run Code Online (Sandbox Code Playgroud)

注意requests这里使用的是 module。另请注意,User-Agent标头是必需的,因为网站返回时403 - Forbidden没有标头。