Jol*_*aha 2 html python regex beautifulsoup html-parsing
我有一系列字符串,类似于"2014年12月27日星期六",我想折腾"星期六"并保存名为"141227"的文件,即年+月+日.到目前为止,一切都在工作,除了我无法获得daypos或yearpos上班的正则表达式.他们都给出了同样的错误:
回溯(最近一次调用最后一次):文件"scrapewaybackblog.py",第17行,在daypos = byline.find(re.compile("[AZ] [az]*\s"))TypeError:期望一个字符缓冲区对象
什么是字符缓冲对象?这是否意味着我的表达有问题?这是我的脚本:
for i in xrange(3, 1, -1):
page = urllib2.urlopen("http://web.archive.org/web/20090204221349/http://www.americansforprosperity.org/nationalblog?page={}".format(i))
soup = BeautifulSoup(page.read())
snippet = soup.find_all('div', attrs={'class': 'blog-box'})
for div in snippet:
byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8')
text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8')
monthpos = byline.find(",")
daypos = byline.find(re.compile("[A-Z][a-z]*\s"))
yearpos = byline.find(re.compile("[A-Z][a-z]*\D\d*\w*\s"))
endpos = monthpos + len(byline)
month = byline[monthpos+1:daypos]
day = byline[daypos+0:yearpos]
year = byline[yearpos+2:endpos]
output_files_pathname = 'Data/' # path where output will go
new_filename = year + month + day + ".txt"
outfile = open(output_files_pathname + new_filename,'w')
outfile.write(date)
outfile.write("\n")
outfile.write(text)
outfile.close()
print "finished another url from page {}".format(i)
Run Code Online (Sandbox Code Playgroud)
我还没弄明白如何使12月= 12,但那是另一次.请帮我找到合适的位置.
不使用正则表达式解析日期字符串,而是使用dateutil以下方法解析它:
from dateutil.parser import parse
for div in soup.select('div.blog-box'):
byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8')
text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8')
dt = parse(byline)
new_filename = "{dt.year}{dt.month}{dt.day}.txt".format(dt=dt)
...
Run Code Online (Sandbox Code Playgroud)
或者,你可以解析字符串datetime.strptime(),但你需要注意后缀:
byline = re.sub(r"(?<=\d)(st|nd|rd|th)", "", byline)
dt = datetime.strptime(byline, '%A, %B %d %Y')
Run Code Online (Sandbox Code Playgroud)
re.sub()这里在数字后面找到st或nd或rd或th字符串,并用空字符串替换后缀.在日期字符串匹配'%A, %B %d %Y'格式之后,请参阅:
一些额外的说明:
urlopen()直接传递给BeautifulSoup构造函数find_all()通过类名,使用CSS选择器 div.blog-boxos.path.join() with上下文管理器固定版本:
import os
import urllib2
from bs4 import BeautifulSoup
from dateutil.parser import parse
for i in xrange(3, 1, -1):
page = urllib2.urlopen("http://web.archive.org/web/20090204221349/http://www.americansforprosperity.org/nationalblog?page={}".format(i))
soup = BeautifulSoup(page)
for div in soup.select('div.blog-box'):
byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8')
text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8')
dt = parse(byline)
new_filename = "{dt.year}{dt.month}{dt.day}.txt".format(dt=dt)
with open(os.path.join('Data', new_filename), 'w') as outfile:
outfile.write(byline)
outfile.write("\n")
outfile.write(text)
print "finished another url from page {}".format(i)
Run Code Online (Sandbox Code Playgroud)