MIT*_*THU -1 python beautifulsoup web-scraping python-3.x
我已经在python中创建了一个脚本,以从网站中获取不同帖子的标题,并且可以完美地捕获它们。
但是,我现在希望此脚本执行的操作是记住上一次抓取的结果,以便当我运行两次时,它不会获取相同的结果。更清楚地说-脚本将在第一次执行时照常解析结果,但在找不到新帖子之前,它将在后续执行中无法获取相同的结果。
使用csv:
import csv
import requests
from bs4 import BeautifulSoup
def get_posts(url):
response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")
for item in soup.select(".summary .question-hyperlink"):
yield item.text
if __name__ == '__main__':
link = '/sf/ask/tagged/web-scraping/'
with open("output.csv","w",newline="") as f:
writer = csv.writer(f)
for item in get_posts(link):
writer.writerow([item])
print(item)
Run Code Online (Sandbox Code Playgroud)
使用数据库:
import mysql.connector
from bs4 import BeautifulSoup
import requests
url = "/sf/ask/tagged/web-scraping/"
def connect():
mydb = mysql.connector.connect(
host="localhost",
user="root",
passwd = "",
database="mydatabase"
)
return mydb
def create_table(link):
conn = connect()
mycursor = conn.cursor()
mycursor.execute("DROP TABLE if exists webdata")
mycursor.execute("CREATE TABLE if not exists webdata (name VARCHAR(255))")
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
for items in soup.select(".summary"):
name = items.select_one(".question-hyperlink").get_text(strip=True)
mycursor.execute("INSERT INTO webdata (name) VALUES (%s)",(name,))
conn.commit()
def fetch_data():
conn = connect()
mycursor = conn.cursor()
mycursor.execute("SELECT * FROM webdata")
for item in mycursor.fetchall():
print(item)
if __name__ == '__main__':
create_table(url)
fetch_data()
Run Code Online (Sandbox Code Playgroud)
上面的脚本每次运行时都会解析相同的结果。
我如何让我的脚本记住上一次抓取的结果,以使其在后续执行中不会再次获取相同的结果?
您需要为每个帖子创建一个唯一ID列表,而StackOverflow已经使用了该ID<div class="question-summary" id="some unique id">
。您可以使用以下方法提取该值:
def get_posts(url):
response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")
for item in soup.select(".question-summary"):
yield item['id'], item.findChild('a', {'class':'question-hyperlink'}).text
Run Code Online (Sandbox Code Playgroud)
这将返回唯一的ID,以及每个问题的标题。
现在,您需要将此唯一ID与已经添加到csv文件中的ID进行比较,如果ID已经存在,则跳过该行。这是为此的工作代码:
if __name__ == '__main__':
link = '/sf/ask/tagged/web-scraping/'
file = open('./output.csv', 'r')
reader = csv.reader(file)
ids = [row[0] for row in reader] #this extracts first column of each row into a list
file.close()
with open('./output.csv', 'w', newline="") as f:
writer = csv.writer(f)
for id, title in get_posts(link):
if id not in ids: # if id isn't already in your list of ids, write the row
writer.writerow([id, title])
Run Code Online (Sandbox Code Playgroud)
值得注意的是,这不是最佳解决方案。您最好使用sqlite或mysql这样的数据库,并id
为每个帖子的列添加一个唯一索引。这样,重复的帖子将由数据库自动处理,并且您不必为每次刮刮而将整个csv文件拉入内存(两次)。
表定义:
sql = '''
CREATE TABLE `webdata` (
id INT AUTO_INCREMENT PRIMARY KEY,
question_id CHAR(30) NOT NULL,
question_title CHAR(75) NOT NULL,
UNIQUE KEY(question_id)
)
'''
mycursor.execute(sql)
Run Code Online (Sandbox Code Playgroud)
批量插入刮擦数据:
def get_posts(url):
response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")
results = []
for item in soup.select(".question-summary"):
question_id = item['id']
question_title = item.findChild('a', {'class':'question-hyperlink'}).text
results.append((question_id, question_title))
return results
sql = 'INSERT IGNORE INTO `webdata` (question_id, question_title) VALUES (%s, %s)'
mycursor.executemany(sql, get_posts(url))
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
255 次 |
最近记录: |