韩语语言标记器

gan*_*ran 5 solr nlp localization tokenize

处理韩语的最佳标记器是什么?

我曾尝试CJKTokenizer在Solr4.0.它正在进行标记化,但准确性非常低.

alv*_*vas 4

POSTECH/K是一款韩语形态分析器,能够轻松对韩语数据进行标记化和 POS 标记。该软件在训练和测试的语料库上报告了 90.7% 的结果(参见http://nlp.postech.ac.kr/download/postag_k/9908_cljournal_gblee.pdf)。

\n\n

我一直在做的一个多语言语料库项目的韩语数据上的​​词性标注达到了81%。

\n\n

然而,有一个问题,你必须使用 Windows 来运行该软件。但我有一个脚本可以绕过这个限制,脚本如下:

\n\n
#!/bin/bash -x\n###############################################################################\n## Sejong-Shell is a script to call POSTAG/SEJONG tagger on Unix Machine\n## because POSTAG/Sejong is only usable in Korean Microsoft Windows environment\n## the original POSTAG/Sejong can be downloaded from\n## http://isoft.postech.ac.kr/Course/CS730b/2005/index.html\n##\n## Sejong-Shell is dependent on WINdows Emulator.\n## The WINE program can be downloaded from\n## http://www.winehq.org/download/\n##\n## The shell scripts accepts the input files from one directory and\n## outputs the tagged files into another while retaining the filename\n###############################################################################\n\ncd <source-file_dir>\n#<source_-ile_dir> is the directory that saves the textfiles that needs tagging\nfor file in `dir -d *`\ndo\n    echo $file\n    sudo cp <source-file_dir>/"$file" <POSTAG-Sejong_dir>/input.txt\n    # <POSTAG-Sejong_dir> refers to the directory where the pos-tagger is saved\n    wine start /Unix "$HOME/postagsejong/sjTaggerInteg.exe"\n    sleep 30\n    # This is necessary so that the file from the current loop won\'t be\n    # overlapping with the next, do increase the time for sleep if the file\n    # is large and needs more than 30 sec for POSTAG/Sejong to tag.\n    sudo cp <POSTAG-Sejong_dir>/output.txt <target-file_dir>/"$file"\n    # <target-file_dir> is where you want the output files to be stored\ndone\n\n# Instead of the sleep command to prevent the overlap:\n#   $sleep 30\n# Alternatively, you can manually continue a loop with the following \n# command that continues a loop after a keystroke input:\n#   $read -p "Press any key to continue\xe2\x80\xa6"\n
Run Code Online (Sandbox Code Playgroud)\n\n

请注意,POSTECH/K 的编码是euc-kr,所以如果它是utf8. 您可以使用以下脚本将 utf8 转换为 euc-kr。

\n\n
#!/usr/bin/python # -*- coding: utf-8 -*-\n\n\'\'\'\npre-sejong clean\n\'\'\'\n\nimport codecs\nimport nltk\nimport os, sys, re, glob\nfrom nltk.tokenize import RegexpTokenizer\n\nreload(sys)\nsys.setdefaultencoding(\'utf-8\')\n\ncwd = \'./gizaclean_ko\' #os.getcwd()\nwrd = \'./presejong_ko\'\n\nkr_sent_tokenizer = nltk.RegexpTokenizer(u\'[^\xef\xbc\x81\xef\xbc\x9f.?!]*[\xef\xbc\x81\xef\xbc\x9f."www.*"]\')\n\n\nfor infile in glob.glob(os.path.join(cwd, \'*.txt\')):\n#   if infile == \'./extract_ko/singapore-sling.txt\': continue\n#   if infile == \'./extract_ko/ion-orchard.txt\': continue\n        print infile\n        (PATH, FILENAME) = os.path.split(infile)\n        reader = open(infile)\n        writer = open(os.path.join(wrd, FILENAME).encode(\'euc-kr\'),\'w\')\n        for line in reader:\n                para = []urlread = lambda url: urllib.urlopen(url).read()\n                para.append (kr_sent_tokenizer.tokenize(unicode(line,\'utf-8\').strip()))\n                for sent in para[0]:\n            newsent = sent.replace(u\'\\xa0\', \' \'.encode(\'utf-8\'))\n            newsent2 = newsent.replace(u\'\\xe7\', \'c\'.encode(\'utf-8\'))\n            newsent3 = newsent2.replace(u\'\\xe9\', \'e\'.encode(\'utf-8\'))\n            newsent4 = newsent3.replace(u\'\\u2013\', \'-\')\n            newsent5 = newsent4.replace(u\'\\xa9\', \'(c)\')\n            newsent6 = newsent5.encode(\'euc-kr\').strip()\n            print newsent6\n            writer.write(newsent6+\'\\n\')     \n
Run Code Online (Sandbox Code Playgroud)\n\n

sejong-shell 来源:Liling Tan。2011。为南洋理工大学建立基础文本 - 多语言语料库 (NTU-MC)。最后一年项目。新加坡:南洋理工大学。第 44 页。)

\n