Zvi*_*Zvi 10 python string date
我希望能够读取一个字符串并返回其中出现的第一个日期.我可以使用现成的模块吗?我试图为所有可能的日期格式编写正则表达式,但它很长.有没有更好的方法呢?
Ros*_*ron 15
您可以在文本的所有子文本上运行日期解析器并选择第一个日期.当然,这样的解决方案要么捕获不是日期的东西,要么捕捉到既不是,也不可能同时捕捉到的东西.
让我提供一个示例,用于dateutil.parser
捕获看起来像日期的任何内容:
import dateutil.parser
from itertools import chain
import re
# Add more strings that confuse the parser in the list
UNINTERESTING = set(chain(dateutil.parser.parserinfo.JUMP,
dateutil.parser.parserinfo.PERTAIN,
['a']))
def _get_date(tokens):
for end in xrange(len(tokens), 0, -1):
region = tokens[:end]
if all(token.isspace() or token in UNINTERESTING
for token in region):
continue
text = ''.join(region)
try:
date = dateutil.parser.parse(text)
return end, date
except ValueError:
pass
def find_dates(text, max_tokens=50, allow_overlapping=False):
tokens = filter(None, re.split(r'(\S+|\W+)', text))
skip_dates_ending_before = 0
for start in xrange(len(tokens)):
region = tokens[start:start + max_tokens]
result = _get_date(region)
if result is not None:
end, date = result
if allow_overlapping or end > skip_dates_ending_before:
skip_dates_ending_before = end
yield date
test = """Adelaide was born in Finchley, North London on 12 May 1999. She was a
child during the Daleks' abduction and invasion of Earth in 2009.
On 1st July 2058, Bowie Base One became the first Human colony on Mars. It
was commanded by Captain Adelaide Brooke, and initially seemed to prove that
it was possible for Humans to live long term on Mars."""
print "With no overlapping:"
for date in find_dates(test, allow_overlapping=False):
print date
print "With overlapping:"
for date in find_dates(test, allow_overlapping=True):
print date
Run Code Online (Sandbox Code Playgroud)
无论你是否允许重叠,代码的结果都是不足为奇的.如果允许重叠,您将获得许多无处可见的日期,如果不允许,则会错过文本中的重要日期.
With no overlapping:
1999-05-12 00:00:00
2009-07-01 20:58:00
With overlapping:
1999-05-12 00:00:00
1999-05-12 00:00:00
1999-05-12 00:00:00
1999-05-12 00:00:00
1999-05-03 00:00:00
1999-05-03 00:00:00
1999-07-03 00:00:00
1999-07-03 00:00:00
2009-07-01 20:58:00
2009-07-01 20:58:00
2058-07-01 00:00:00
2058-07-01 00:00:00
2058-07-01 00:00:00
2058-07-01 00:00:00
2058-07-03 00:00:00
2058-07-03 00:00:00
2058-07-03 00:00:00
2058-07-03 00:00:00
Run Code Online (Sandbox Code Playgroud)
基本上,如果允许重叠:
但是,如果不允许重叠,则"2009.在2058年7月1日"将被解析为2009-07-01 20:58:00并且不会尝试解析该期间之后的日期.