Jon*_*Jon 9 python parsing nlp
我正在尝试为足球比赛提供解析器.我非常宽松地使用"自然语言"一词,所以请耐心等待,因为我对这个领域几乎一无所知.
以下是我正在使用的一些示例(格式:TIME | DOWN&DIST | OFF_TEAM | DESCRIPTION):
04:39|4th and 20@NYJ46|Dal|Mat McBriar punts for 32 yards to NYJ14. Jeremy Kerley - no return. FUMBLE, recovered by NYJ.|
04:31|1st and 10@NYJ16|NYJ|Shonn Greene rush up the middle for 5 yards to the NYJ21. Tackled by Keith Brooking.|
03:53|2nd and 5@NYJ21|NYJ|Mark Sanchez rush to the right for 3 yards to the NYJ24. Tackled by Anthony Spencer. FUMBLE, recovered by NYJ (Matthew Mulligan).|
03:20|1st and 10@NYJ33|NYJ|Shonn Greene rush to the left for 4 yards to the NYJ37. Tackled by Jason Hatcher.|
02:43|2nd and 6@NYJ37|NYJ|Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins.|
02:02|1st and 10@NYJ44|NYJ|Shonn Greene rush to the right for 1 yard to the NYJ45. Tackled by Anthony Spencer.|
01:23|2nd and 9@NYJ45|NYJ|Mark Sanchez pass to the left to LaDainian Tomlinson for 5 yards to the 50. Tackled by Sean Lee.|
Run Code Online (Sandbox Code Playgroud)
到目前为止,我已经编写了一个愚蠢的解析器来处理所有简单的东西(playID,季度,时间,向下和距离,进攻团队)以及一些脚本,这些脚本可以获取这些数据并将其清理成上面看到的格式.单行变为"Play"对象以存储到数据库中.
这里的困难部分(至少对我来说)是解析戏剧的描述.以下是我想从该字符串中提取的一些信息:
示例字符串:
"Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins."
Run Code Online (Sandbox Code Playgroud)
结果:
turnover = False
interception = False
fumble = False
to_on_downs = False
passing = True
rushing = False
direction = 'left'
loss = False
penalty = False
scored = False
TD = False
PA = False
FG = False
TPC = False
SFTY = False
punt = False
kickoff = False
ret_yardage = 0
yardage_diff = 7
playmakers = ['Mark Sanchez', 'Shonn Greene', 'Mike Jenkins']
Run Code Online (Sandbox Code Playgroud)
我对初始解析器的逻辑是这样的:
# pass, rush or kick
# gain or loss of yards
# scoring play
# Who scored? off or def?
# TD, PA, FG, TPC, SFTY?
# first down gained
# punt?
# kick?
# return yards?
# penalty?
# def or off?
# turnover?
# INT, fumble, to on downs?
# off play makers
# def play makers
Run Code Online (Sandbox Code Playgroud)
描述可以变得非常毛茸茸(多次摸索和恢复与惩罚等),我想知道我是否可以利用一些NLP模块.我可能会在像解析器这样的哑/静态状态机上花几天时间,但如果有人建议如何使用NLP技术来处理它,我想听听它们.
我认为 pyparsing 在这里会非常有用。
您的输入文本看起来非常规则(与真正的自然语言不同),而 pyparsing 在这方面非常擅长。你应该看看它。
例如解析以下句子:
Mat McBriar punts for 32 yards to NYJ14.
Mark Sanchez rush to the right for 3 yards to the NYJ24.
Run Code Online (Sandbox Code Playgroud)
您可以使用类似的内容定义一个解析句子(在文档中查找确切的语法):
name = Group(Word(alphas) + Word(alphas)).setResultsName('name')
action = Or(Exact("punts"),Exact("rush")).setResultsName('action') + Optional(Exact("to the")) + Or(Exact("left"), Exact("right")) )
distance = Word(number).setResultsName("distance") + Exact("yards")
pattern = name + action + Exact("for") + distance + Or(Exact("to"), Exact("to the")) + Word()
Run Code Online (Sandbox Code Playgroud)
pyparsing 会使用这种模式破坏字符串。它还将返回一个字典,其中包含从句子中提取的项目名称、动作和距离。