mli*_*ner 17 python parsing python-dateutil
我有一个字符串,其中包含多个日期值,我想全部解析它们.字符串是自然语言,所以到目前为止我发现的最好的东西是dateutil.
不幸的是,如果一个字符串中有多个日期值,dateutil会抛出一个错误:
>>> s = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"
>>> parse(s, fuzzy=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/pymodules/python2.7/dateutil/parser.py", line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/usr/lib/pymodules/python2.7/dateutil/parser.py", line 303, in parse
raise ValueError, "unknown string format"
ValueError: unknown string format
Run Code Online (Sandbox Code Playgroud)
关于如何解析长字符串中的所有日期的任何想法?理想情况下,会创建一个列表,但如果需要,我可以自己处理.
我正在使用Python,但此时,如果他们完成工作,其他语言可能还可以.
PS - 我想我可以在中间递归分割输入文件并尝试再试一次,直到它工作,但这是一个黑客的地狱.
Mat*_*ttH 16
看一下,最简单的方法是将dateutil 解析器修改为具有模糊多项选项.
parser._parse获取你的字符串,用它标记它,_timelex然后将标记与中定义的数据进行比较parserinfo.
在这里,如果令牌与任何内容都不匹配parserinfo,则除非fuzzy为True,否则解析将失败.
我建议您在没有任何处理时间令牌时允许不匹配,然后当您遇到不匹配时,在该点处理解析后的数据并再次开始寻找时间令牌.
不应该花太多精力.
更新
当你在等待补丁进入时......
这有点hacky,在库中使用非公共函数,但不需要修改库,也不是反复试验.如果您有任何可以变成浮点数的单独令牌,则可能会出现误报.您可能需要更多地过滤结果.
from dateutil.parser import _timelex, parser
a = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"
p = parser()
info = p.info
def timetoken(token):
try:
float(token)
return True
except ValueError:
pass
return any(f(token) for f in (info.jump,info.weekday,info.month,info.hms,info.ampm,info.pertain,info.utczone,info.tzoffset))
def timesplit(input_string):
batch = []
for token in _timelex(input_string):
if timetoken(token):
if info.jump(token):
continue
batch.append(token)
else:
if batch:
yield " ".join(batch)
batch = []
if batch:
yield " ".join(batch)
for item in timesplit(a):
print "Found:", item
print "Parsed:", p.parse(item)
Run Code Online (Sandbox Code Playgroud)
产量:
Found: 2011 04 23 Parsed: 2011-04-23 00:00:00 Found: 29 July 1928 Parsed: 1928-07-29 00:00:00
迪特的更新
Dateutil 2.1似乎是为了与python3兼容而编写的,并使用了一个名为的"兼容性"库six.有些事情是不对的,并没有将str对象视为文本.
如果您将字符串作为unicode或类文件对象传递,则此解决方案适用于dateutil 2.1:
from cStringIO import StringIO
for item in timesplit(StringIO(a)):
print "Found:", item
print "Parsed:", p.parse(StringIO(item))
Run Code Online (Sandbox Code Playgroud)
如果要在parserinfo上设置选项,请实例化parserinfo并将其传递给解析器对象.例如:
from dateutil.parser import _timelex, parser, parserinfo
info = parserinfo(dayfirst=True)
p = parser(info)
Run Code Online (Sandbox Code Playgroud)
当我离线时,我对昨天发布的答案感到困扰.是的,它完成了这项工作,但它不必要地复杂且极其低效.
这是封底版本应该做得更好!
import itertools
from dateutil import parser
jumpwords = set(parser.parserinfo.JUMP)
keywords = set(kw.lower() for kw in itertools.chain(
parser.parserinfo.UTCZONE,
parser.parserinfo.PERTAIN,
(x for s in parser.parserinfo.WEEKDAYS for x in s),
(x for s in parser.parserinfo.MONTHS for x in s),
(x for s in parser.parserinfo.HMS for x in s),
(x for s in parser.parserinfo.AMPM for x in s),
))
def parse_multiple(s):
def is_valid_kw(s):
try: # is it a number?
float(s)
return True
except ValueError:
return s.lower() in keywords
def _split(s):
kw_found = False
tokens = parser._timelex.split(s)
for i in xrange(len(tokens)):
if tokens[i] in jumpwords:
continue
if not kw_found and is_valid_kw(tokens[i]):
kw_found = True
start = i
elif kw_found and not is_valid_kw(tokens[i]):
kw_found = False
yield "".join(tokens[start:i])
# handle date at end of input str
if kw_found:
yield "".join(tokens[start:])
return [parser.parse(x) for x in _split(s)]
Run Code Online (Sandbox Code Playgroud)
用法示例:
>>> parse_multiple("I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928")
[datetime.datetime(2011, 4, 23, 0, 0), datetime.datetime(1928, 7, 29, 0, 0)]
Run Code Online (Sandbox Code Playgroud)
值得注意的是,它的行为与dateutil.parser.parse处理空/未知字符串时略有不同.Dateutil将返回当天,同时parse_multiple返回一个空列表,恕我直言,这是人们所期望的.
>>> from dateutil import parser
>>> parser.parse("")
datetime.datetime(2011, 8, 12, 0, 0)
>>> parse_multiple("")
[]
Run Code Online (Sandbox Code Playgroud)
PS刚发现MattH的更新答案非常相似.