Nur*_*rse 13 python parsing nlp machine-learning information-extraction
我是一名护士,我知道python,但我不是专家,只是用它来处理DNA序列
我们得到了用人类语言编写的医院记录,我应该将这些数据插入数据库或csv文件,但它们超过5000线条,这可能是如此困难.所有数据都以一致的格式编写,让我给大家展示一个例子
11/11/2010 - 09:00am : He got nausea, vomiting and died 4 hours later
Run Code Online (Sandbox Code Playgroud)
我应该得到以下数据
Sex: Male
Symptoms: Nausea
Vomiting
Death: True
Death Time: 11/11/2010 - 01:00pm
Run Code Online (Sandbox Code Playgroud)
另一个例子
11/11/2010 - 09:00am : She got heart burn, vomiting of blood and died 1 hours later in the operation room
Run Code Online (Sandbox Code Playgroud)
我明白了
Sex: Female
Symptoms: Heart burn
Vomiting of blood
Death: True
Death Time: 11/11/2010 - 10:00am
Run Code Online (Sandbox Code Playgroud)
当我说.......所以in是一个关键词并且之后的所有文本都是一个地方,直到我找到另一个关键词
在开始他或她确定性,得到..... 时,顺序不一致...无论后面是一组症状,我应该根据分隔符分开,可以是逗号,夸张或其他什么,但它是一致的同一条线
死了.....小时后也应该得到多少小时,有时患者仍然活着并且已经出院......等等
我说我们有很多约定,我想如果我能用关键字和模式对文本进行标记,我可以完成工作.所以,如果你知道一个有用的函数/模块/教程/工具,最好在python(如果不是python,所以一个gui工具会很好)
一些信息:
there are a lot of rules to express various medical data but here are few examples
- Start with the same date/time format followed by a space followd by a colon followed by a space followed by He/She followed space followed by rules separated by and
- Rules:
* got <symptoms>,<symptoms>,....
* investigations were done <investigation>,<investigation>,<investigation>,......
* received <drug or procedure>,<drug or procedure>,.....
* discharged <digit> (hour|hours) later
* kept under observation
* died <digit> (hour|hours) later
* died <digit> (hour|hours) later in <place>
other rules do exist but they follow the same idea
Run Code Online (Sandbox Code Playgroud)
以下是一些可以解决此问题的方法 -
看看这对你有用.可能需要一些调整.
new_file = open('parsed_file', 'w')
for rec in open("your_csv_file"):
tmp = rec.split(' : ')
date = tmp[0]
reason = tmp[1]
if reason[:2] == 'He':
sex = 'Male'
symptoms = reason.split(' and ')[0].split('He got ')[1]
else:
sex = 'Female'
symptoms = reason.split(' and ')[0].split('She got ')[1]
symptoms = [i.strip() for i in symptoms.split(',')]
symptoms = '\n'.join(symptoms)
if 'died' in rec:
died = 'True'
else:
died = 'False'
new_file.write("Sex: %s\nSymptoms: %s\nDeath: %s\nDeath Time: %s\n\n" % (sex, symptoms, died, date))
Run Code Online (Sandbox Code Playgroud)
Ech记录是新行分开的\n
,因为您没有提到一个患者记录是2个新行与\n\n
另一个分开.
后来: @Nurse你到底做了什么?只是好奇.
这使用dateutil来解析日期(例如'11/11/2010 - 09:00 am'),并使用parsedatetime来解析相对时间(例如'4小时后'):
import dateutil.parser as dparser
import parsedatetime.parsedatetime as pdt
import parsedatetime.parsedatetime_consts as pdc
import time
import datetime
import re
import pprint
pdt_parser = pdt.Calendar(pdc.Constants())
record_time_pat=re.compile(r'^(.+)\s+:')
sex_pat=re.compile(r'\b(he|she)\b',re.IGNORECASE)
death_time_pat=re.compile(r'died\s+(.+hours later).*$',re.IGNORECASE)
symptom_pat=re.compile(r'[,-]')
def parse_record(astr):
match=record_time_pat.match(astr)
if match:
record_time=dparser.parse(match.group(1))
astr,_=record_time_pat.subn('',astr,1)
else: sys.exit('Can not find record time')
match=sex_pat.search(astr)
if match:
sex=match.group(1)
sex='Female' if sex.lower().startswith('s') else 'Male'
astr,_=sex_pat.subn('',astr,1)
else: sys.exit('Can not find sex')
match=death_time_pat.search(astr)
if match:
death_time,date_type=pdt_parser.parse(match.group(1),record_time)
if date_type==2:
death_time=datetime.datetime.fromtimestamp(
time.mktime(death_time))
astr,_=death_time_pat.subn('',astr,1)
is_dead=True
else:
death_time=None
is_dead=False
astr=astr.replace('and','')
symptoms=[s.strip() for s in symptom_pat.split(astr)]
return {'Record Time': record_time,
'Sex': sex,
'Death Time':death_time,
'Symptoms': symptoms,
'Death':is_dead}
if __name__=='__main__':
tests=[('11/11/2010 - 09:00am : He got nausea, vomiting and died 4 hours later',
{'Sex':'Male',
'Symptoms':['got nausea', 'vomiting'],
'Death':True,
'Death Time':datetime.datetime(2010, 11, 11, 13, 0),
'Record Time':datetime.datetime(2010, 11, 11, 9, 0)}),
('11/11/2010 - 09:00am : She got heart burn, vomiting of blood and died 1 hours later in the operation room',
{'Sex':'Female',
'Symptoms':['got heart burn', 'vomiting of blood'],
'Death':True,
'Death Time':datetime.datetime(2010, 11, 11, 10, 0),
'Record Time':datetime.datetime(2010, 11, 11, 9, 0)})
]
for record,answer in tests:
result=parse_record(record)
pprint.pprint(result)
assert result==answer
print
Run Code Online (Sandbox Code Playgroud)
收益率:
{'Death': True,
'Death Time': datetime.datetime(2010, 11, 11, 13, 0),
'Record Time': datetime.datetime(2010, 11, 11, 9, 0),
'Sex': 'Male',
'Symptoms': ['got nausea', 'vomiting']}
{'Death': True,
'Death Time': datetime.datetime(2010, 11, 11, 10, 0),
'Record Time': datetime.datetime(2010, 11, 11, 9, 0),
'Sex': 'Female',
'Symptoms': ['got heart burn', 'vomiting of blood']}
Run Code Online (Sandbox Code Playgroud)
注意:小心解析日期."8/9/2010"是指8月9日还是9月8日?所有记录员都使用相同的约定吗?如果你选择使用dateutil(我真的认为如果日期字符串没有严格的结构,这是最好的选择)请务必阅读dateutil文档中的"格式优先级"部分,这样你就可以(希望)解决'8/9/2010'正确.如果您不能保证所有记录管理员使用相同的约定来指定日期,那么将手动检查此脚本的结果.无论如何,这可能是明智的.