我有一个包含只包含大写字母的句子的数据库.该数据库是技术性的,包含医学术语,我想将其标准化,以便大写(接近)用户期望的大小.实现这一目标的最佳方法是什么?我可以使用免费的数据集来帮助完成这个过程吗?
一种方法是从POS标记推断大写,例如使用Python Natural Language Toolkit(NLTK):
import nltk, re
def truecase(text):
truecased_sents = [] # list of truecased sentences
# apply POS-tagging
tagged_sent = nltk.pos_tag([word.lower() for word in nltk.word_tokenize(text)])
# infer capitalization from POS-tags
normalized_sent = [w.capitalize() if t in ["NN","NNS"] else w for (w,t) in tagged_sent]
# capitalize first word in sentence
normalized_sent[0] = normalized_sent[0].capitalize()
# use regular expression to get punctuation right
pretty_string = re.sub(" (?=[\.,'!?:;])", "", ' '.join(normalized_sent))
return pretty_string
Run Code Online (Sandbox Code Playgroud)
这不是完美的,特别是因为我不知道你的数据是什么样的,但也许你可以得到这个想法:
>>> text = "Clonazepam Has Been Approved As An Anticonvulsant To Be Manufactured In 0.5mg, 1mg And 2mg Tablets. It Is The Generic Equivalent Of Roche Laboratories' Klonopin."
>>> truecase(text)
"Clonazepam has been approved as an anticonvulsant to be manufactured in 0.5mg, 1mg and 2mg Tablets. It is the generic Equivalent of Roche Laboratories' Klonopin."
Run Code Online (Sandbox Code Playgroud)
搜索有关 truecasing 的工作:http://en.wikipedia.org/wiki/Truecasing
如果您能够访问具有正常大小写的类似医疗数据,那么生成您自己的数据集将非常容易。将所有内容大写并使用到原始文本的映射来训练/测试您的算法。
归档时间: |
|
查看次数: |
3321 次 |
最近记录: |