从python中的字符串中提取日期

Gee*_*sht 0 python regex nlp date python-2.7

我有一个字符串

 fmt_string2 = I want to apply for leaves from 12/12/2017 to 12/18/2017
Run Code Online (Sandbox Code Playgroud)

在这里,我想提取以下日期.但我的代码需要很强大,因为它可以是2017年1月12日或1月12日的任何格式,其位置也可以改变.对于上面的代码,我尝试过:

''.join(fmt_string2.split()[-1].split('.')[::-10])
Run Code Online (Sandbox Code Playgroud)

但在这里,我给出了约会的位置.我不想要的.任何人都可以帮助制作一个强大的代码来提取日期.

小智 5

如果12/12/2017,12 January 2017和,12 Jan 17是唯一可能的模式,那么使用正则表达式的以下代码就足够了.

import re

string = 'I want to apply for leaves from 12/12/2017 to 12/18/2017 I want to apply for leaves from 12 January 2017 to ' \
       '12/18/2017 I want to apply for leaves from 12/12/2017 to 12 Jan 17 '

matches = re.findall('(\d{2}[\/ ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[\/ ]\d{2,4})', string)

for match in matches:
    print(match[0])
Run Code Online (Sandbox Code Playgroud)

输出:

12/12/2017
12/18/2017
12 January 2017
12/18/2017
12/12/2017
12 Jan 17
Run Code Online (Sandbox Code Playgroud)

在regex101中了解正则表达式.


Ayu*_*yan 5

使用正则表达式

我建议采用以下方法,而不是完全通过正则表达式:

import re
from dateutil.parser import parse
Run Code Online (Sandbox Code Playgroud)

示例文本

text = """
I want to apply for leaves from 12/12/2017 to 12/18/2017
then later from 12 January 2018 to 18 January 2018
then lastly from 12 Feb 2018 to 18 Feb 2018
"""
Run Code Online (Sandbox Code Playgroud)

正则表达式用于查找“从 A 到 B”形式的任何内容。这里的优点是我不必处理每一个案例并继续构建我的正则表达式。相反,这是动态的。

pattern = re.compile(r'from (.*) to (.*)')    
matches = re.findall(pattern, text)
Run Code Online (Sandbox Code Playgroud)

上面的文本正则表达式的模式是

[('12/12/2017', '12/18/2017'), ('12 January 2018', '18 January 2018'), ('12 Feb 2018', '18 Feb 2018')]
Run Code Online (Sandbox Code Playgroud)

对于每场比赛,我都会解析日期。对于不是日期的值会引发异常,因此在 except 块中我们通过了。

for val in matches:
    try:
        dt_from = parse(val[0])
        dt_to = parse(val[1])

        print("Leave applied from", dt_from.strftime('%d/%b/%Y'), "to", dt_to.strftime('%d/%b/%Y'))
    except ValueError:
        print("skipping", val)
Run Code Online (Sandbox Code Playgroud)

输出:

Leave applied from 12/Dec/2017 to 18/Dec/2017
Leave applied from 12/Jan/2018 to 18/Jan/2018
Leave applied from 12/Feb/2018 to 18/Feb/2018
Run Code Online (Sandbox Code Playgroud)

使用 pyparsing

使用正则表达式有一个限制,即它可能最终变得非常复杂,以便使其更加动态地处理不那么简单的输入,例如

text = """
I want to apply for leaves from start 12/12/2017 to end date 12/18/2017 some random text
then later from 12 January 2018 to 18 January 2018 some random text
then lastly from 12 Feb 2018 to 18 Feb 2018 some random text
"""
Run Code Online (Sandbox Code Playgroud)

因此,Pyton 的 pyparsing 模块最适合这里。

import pyparsing as pp
Run Code Online (Sandbox Code Playgroud)

这里的方法是创建一个可以解析整个文本的字典。

为可用作 pyparsing 关键字的月份名称创建关键字

months_list= []
for month_idx in range(1, 13):
    months_list.append(calendar.month_name[month_idx])
    months_list.append(calendar.month_abbr[month_idx])

# join the list to use it as pyparsing keyword
month_keywords = " ".join(months_list)
Run Code Online (Sandbox Code Playgroud)

用于解析的字典:

# date separator - can be one of '/', '.', or ' '
separator = pp.Word("/. ")

# Dictionary for numeric date e.g. 12/12/2018
numeric_date = pp.Combine(pp.Word(pp.nums, max=2) + separator + pp.Word(pp.nums, max=2) + separator + pp.Word(pp.nums, max=4))

# Dictionary for text date e.g. 12/Jan/2018
text_date = pp.Combine(pp.Word(pp.nums, max=2) + separator + pp.oneOf(month_keywords) + separator + pp.Word(pp.nums, max=4))

# Either numeric or text date
date_pattern = numeric_date | text_date

# Final dictionary - from x to y
pattern = pp.Suppress(pp.SkipTo("from") + pp.Word("from") + pp.Optional("start") + pp.Optional("date")) + date_pattern
pattern += pp.Suppress(pp.Word("to") + pp.Optional("end") + pp.Optional("date")) + date_pattern

# Group the pattern, also it can be multiple
pattern = pp.OneOrMore(pp.Group(pattern))
Run Code Online (Sandbox Code Playgroud)

解析输入文本:

result = pattern.parseString(text)

# Print result
for match in result:
    print("from", match[0], "to", match[1])
Run Code Online (Sandbox Code Playgroud)

输出:

from 12/12/2017 to 12/18/2017
from 12 January 2018 to 18 January 2018
from 12 Feb 2018 to 18 Feb 2018
Run Code Online (Sandbox Code Playgroud)