小编mur*_*aby的帖子

撇号打印为 â\x80\x99

import requests
from bs4 import BeautifulSoup
import re

source_url = requests.get('http://www.nytimes.com/pages/business/index.html')
div_classes = {'class' :['ledeStory' , 'story']}
title_tags = ['h2','h3','h4','h5','h6']

source_text = source_url.text
soup = BeautifulSoup(source_text, 'html.parser')


stories = soup.find_all("div", div_classes)

h = []; h2 = []; h3 = []; h4 =[]

for x in range(len(stories)):

    for x2 in range(len(title_tags)):
        hold = []; hold2 = []
        hold = stories[x].find(title_tags[x2])

        if hold is not None:
            hold2 = hold.find('a')

            if hold2 is not None:
                hh = (((hold.text.strip('a'))).strip())
                h.append(hh)
                #h.append(re.sub(r'[^\x00-\x7f]',r'', ((hold.text.strip('a'))).strip()))
                #h2.append(hold2.get('href'))

    hold …

Run Code Online (Sandbox Code Playgroud)

python string byte utf-8

mur*_*aby

lucky-day

5
推荐指数

1
解决办法

2498
查看次数