Ora*_*Box 6 mysql html-parsing
tl; dr:我正在寻找一种方法来查找数据库中缺少信息的条目,从网站获取该信息并将其添加到数据库条目中.
我们有一个媒体管理程序,它使用mySQL表来存储信息.当员工下载媒体(视频文件,图片,音频文件),并将其导入到他们的媒体管理器假设也对媒体的描述(从源网站)复制并添加到媒体管理器的说明.但是,数千个文件尚未完成.
文件名(例如file123 .mov)是唯一的,可以通过访问源网站上的URL来访问该文件的详细信息页面:
website.com/content/ file123
我们想要从该页面中获取的信息具有始终相同的元素ID.
在我看来,这个过程将是:
- 连接到数据库和Load表
- 过滤器:
"format"是"Still Image (JPEG)"- 过滤器:
"description"是"NULL"- 获得第一个结果
- 得到
"FILENAME"没有扩展)- 加载网址:website.com/content/
FILENAME- 复制元素的内容
"description"(在网站上)- 将内容粘贴到
"description"(SQL条目)- 得到第二个结果
- 冲洗并重复直至达到最后结果
我的问题是:
我也不知道有任何现有的软件包可以完成您正在寻找的一切。然而,Python 可以连接到数据库、轻松发出 Web 请求并处理脏 html。假设您已经安装了 Python,则需要三个包:
您可以使用 pip 命令或 Windows 安装程序安装这些软件包。每个站点上都有相应的说明。整个过程不会超过10分钟。
import MySQLdb as db
import os.path
import requests
from bs4 import BeautifulSoup
# Connect to the database. Fill in these fields as necessary.
con = db.connect(host='hostname', user='username', passwd='password',
db='dbname')
# Create and execute our SELECT sql statement.
select = con.cursor()
select.execute('SELECT filename FROM table_name \
WHERE format = ? AND description = NULL',
('Still Image (JPEG)',))
while True:
# Fetch a row from the result of the SELECT statement.
row = select.fetchone()
if row is None: break
# Use Python's built-in os.path.splitext to split the extension
# and get the url_name.
filename = row[0]
url_name = os.path.splitext(filename)[0]
url = 'http://www.website.com/content/' + url_name
# Make the web request. You may want to rate-limit your requests
# so that the website doesn't get angry. You can slow down the
# rate by inserting a pause with:
#
# import time # You can put this at the top with other imports
# time.sleep(1) # This will wait 1 second.
response = requests.get(url)
if response.status_code != 200:
# Don't worry about skipped urls. Just re-run this script
# on spurious or network-related errors.
print 'Error accessing:', url, 'SKIPPING'
continue
# Parse the result. BeautifulSoup does a great job handling
# mal-formed input.
soup = BeautifulSoup(response.content)
description = soup.find('div', {'id': 'description'}).contents
# And finally, update the database with another query.
update = db.cursor()
update.execute('UPDATE table_name SET description = ? \
WHERE filename = ?',
(description, filename))
Run Code Online (Sandbox Code Playgroud)
我要警告的是,我已经尽力使该代码“看起来正确”,但我还没有实际测试过它。您需要填写私人详细信息。