HTTP错误999:请求被拒绝

Dee*_*yan 5 python mechanize beautifulsoup linkedin web-scraping

我试图使用BeautifulSoup从LinkedIn上抓取一些网页,但不断收到错误“ HTTP错误999:请求被拒绝”。有没有办法避免此错误。如果您看一下我的代码,我已经尝试过Mechanize和URLLIB2,并且两者都给了我相同的错误。

from __future__ import unicode_literals
from bs4 import BeautifulSoup
import urllib2
import csv
import os
import re
import requests
import pandas as pd
import urlparse
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
from BeautifulSoup import BeautifulStoneSoup
import urllib
import urlparse
import pdb
import codecs
from BeautifulSoup import UnicodeDammit
import codecs
import webbrowser
from urlgrabber import urlopen
from urlgrabber.grabber import URLGrabber
import mechanize

fout5 = codecs.open('data.csv','r', encoding='utf-8', errors='replace')

for y in range(2,10,1):


    url = "https://www.linkedin.com/job/analytics-%2b-data-jobs-united-kingdom/?sort=relevance&page_num=1"

    params = {'page_num':y}

    url_parts = list(urlparse.urlparse(url))
    query = dict(urlparse.parse_qsl(url_parts[4]))
    query.update(params)

    url_parts[4] = urllib.urlencode(query)
    y = urlparse.urlunparse(url_parts)
    #print y



    #url = urllib2.urlopen(y)
    #f = urllib2.urlopen(y)

    op = mechanize.Browser() # use mecahnize's browser
    op.set_handle_robots(False) #tell the webpage you're not a robot
    j = op.open(y)
    #print op.title()


    #g = URLGrabber()
    #data = g.urlread(y)
    #data = fo.read()
    #print data

    #html = response.read()
    soup1 = BeautifulSoup(y)
    print soup1
Run Code Online (Sandbox Code Playgroud)

Mat*_*DMo 5

您应该直接使用LinkedIn REST API或使用python-linkedin. 它允许直接访问数据,而不是尝试抓取大量 JavaScript 的网站。

  • 问题是您需要公司网站的管理员权限才能获取公开的公司信息。这是相当愚蠢的。 (7认同)

f43*_*d65 4

尝试设置User-Agent标题。添加此行之后op.set_handle_robots(False)

op.addheaders = [('User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36")]
Run Code Online (Sandbox Code Playgroud)

编辑:如果你想抓取网站,首先检查它是否有 API 或处理 API 的库。