Stanford Parser和NLTK

Tha*_*ray 89 python parsing nlp nltk stanford-nlp

是否可以在NLTK中使用Stanford Parser?(我不是在谈论斯坦福POS.)

dan*_*r89 87

请注意,此答案适用于NLTK v 3.0,而不适用于更新版本.

当然,在Python中尝试以下内容:

import os
from nltk.parse import stanford
os.environ['STANFORD_PARSER'] = '/path/to/standford/jars'
os.environ['STANFORD_MODELS'] = '/path/to/standford/jars'

parser = stanford.StanfordParser(model_path="/location/of/the/englishPCFG.ser.gz")
sentences = parser.raw_parse_sents(("Hello, My name is Melroy.", "What is your name?"))
print sentences

# GUI
for line in sentences:
    for sentence in line:
        sentence.draw()
Run Code Online (Sandbox Code Playgroud)

输出:

[树('ROOT',[树('S',[树('INTJ',[树('UH',['你好'])]),树(',',[',']),树('NP',[树('PRP $',['我''),树('NN',['名称'])]),树('VP',[树('VBZ',[ '是'],树('ADJP',[树('JJ',['Melroy'])])]),树('.',['.'])])]),树(' ROOT',[树('SBARQ',[树('WHNP',[树('WP',['什么'])]),树('SQ',[树('VBZ',['是' ]),树('NP',[树('PRP $',['你''),树('NN',['名称'])])]),树('.',['? "])])])]

注1: 在此示例中,解析器和模型jar都在同一文件夹中.

笔记2:

  • stanford解析器的文件名是:stanford-parser.jar
  • stanford模型的文件名是:stanford-parser-xxx-models.jar

注3: 本englishPCFG.ser.gz文件,可以发现里面的models.jar文件(/edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz).请使用come archive manager来"解压缩"models.jar文件.

注意4: 确保您使用的是Java JRE(运行时环境)1.8,也称为Oracle JDK 8.否则您将获得:不支持的major.minor版本52.0.

安装

  1. https://github.com/nltk/nltk下载NLTK v3 .并安装NLTK:

    sudo python setup.py安装

  2. 您可以使用NLTK下载程序使用Python获取Stanford Parser:

    import nltk
    nltk.download()
    
    Run Code Online (Sandbox Code Playgroud)
  3. 试试我的榜样!(不要忘记更改jar路径并将模型路径更改为ser.gz位置)

要么:

  1. 下载并安装NLTK v3,与上面相同.

  2. 从(当前版本 filename是stanford-parser-full-2015-01-29.zip)下载最新版本:http: //nlp.stanford.edu/software/lex-parser.shtml#Download

  3. 提取standford-parser-full-20xx-xx-xx.zip.

  4. 创建一个新文件夹(在我的示例中为'jars').将提取的文件放入此jar文件夹:stanford-parser-3.xx-models.jar和stanford-parser.jar.

    如上所示,您可以使用环境变量(STANFORD_PARSER和STANFORD_MODELS)指向此'jars'文件夹.我正在使用Linux,所以如果你使用Windows,请使用类似:C:// folder // jars.

  5. 使用Archive manager(7zip)打开stanford-parser-3.xx-models.jar.

  6. 浏览jar文件; 埃杜/斯坦福/ NLP /模型/ lexparser.再次,提取名为'englishPCFG.ser.gz'的文件.记住提取此ser.gz文件的位置.

  7. 创建StanfordParser实例时,可以将模型路径作为参数提供.这是模型的完整路径,在我们的案例中为/location/of/englishPCFG.ser.gz.

  8. 试试我的榜样!(不要忘记更改jar路径并将模型路径更改为ser.gz位置)

  • @alexis:从[这里](https://github.com/nltk/nltk)下载nltk 3.0 @Nick Retallack:它应该改为`raw_parse_sents()` (5认同)

alv*_*vas 77

弃用的答案

下面是过时的答案,请用该解决方案/sf/answers/3638709651/为NLTK V3.3及以上.


EDITED

注意:以下答案仅适用于:

  • NLTK版本> = 3.2.4
  • 斯坦福工具自2015-04-20开始编制
  • Python 2.7,3.4和3.5(Python 3.6尚未正式支持)

由于这两种工具变化相当快,因此API可能在3-6个月后看起来非常不同.请将以下答案视为时间而非永恒的解决方案.

请参阅https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software以获取有关如何使用NLTK连接Stanford NLP工具的最新说明!


TL; DR

cd $HOME

# Update / Install NLTK
pip install -U nltk

# Download the Stanford NLP tools
wget http://nlp.stanford.edu/software/stanford-ner-2015-04-20.zip
wget http://nlp.stanford.edu/software/stanford-postagger-full-2015-04-20.zip
wget http://nlp.stanford.edu/software/stanford-parser-full-2015-04-20.zip
# Extract the zip file.
unzip stanford-ner-2015-04-20.zip 
unzip stanford-parser-full-2015-04-20.zip 
unzip stanford-postagger-full-2015-04-20.zip


export STANFORDTOOLSDIR=$HOME

export CLASSPATH=$STANFORDTOOLSDIR/stanford-postagger-full-2015-04-20/stanford-postagger.jar:$STANFORDTOOLSDIR/stanford-ner-2015-04-20/stanford-ner.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-04-20/stanford-parser.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-04-20/stanford-parser-3.5.2-models.jar

export STANFORD_MODELS=$STANFORDTOOLSDIR/stanford-postagger-full-2015-04-20/models:$STANFORDTOOLSDIR/stanford-ner-2015-04-20/classifiers
Run Code Online (Sandbox Code Playgroud)

然后:

>>> from nltk.tag.stanford import StanfordPOSTagger
>>> st = StanfordPOSTagger('english-bidirectional-distsim.tagger')
>>> st.tag('What is the airspeed of an unladen swallow ?'.split())
[(u'What', u'WP'), (u'is', u'VBZ'), (u'the', u'DT'), (u'airspeed', u'NN'), (u'of', u'IN'), (u'an', u'DT'), (u'unladen', u'JJ'), (u'swallow', u'VB'), (u'?', u'.')]

>>> from nltk.tag import StanfordNERTagger
>>> st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') 
>>> st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
[(u'Rami', u'PERSON'), (u'Eid', u'PERSON'), (u'is', u'O'), (u'studying', u'O'), (u'at', u'O'), (u'Stony', u'ORGANIZATION'), (u'Brook', u'ORGANIZATION'), (u'University', u'ORGANIZATION'), (u'in', u'O'), (u'NY', u'O')]


>>> from nltk.parse.stanford import StanfordParser
>>> parser=StanfordParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
>>> list(parser.raw_parse("the quick brown fox jumps over the lazy dog"))
[Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['brown']), Tree('NN', ['fox'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['dog'])])])])])])]

>>> from nltk.parse.stanford import StanfordDependencyParser
>>> dep_parser=StanfordDependencyParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
>>> print [parse.tree() for parse in dep_parser.raw_parse("The quick brown fox jumps over the lazy dog.")]
[Tree('jumps', [Tree('fox', ['The', 'quick', 'brown']), Tree('dog', ['over', 'the', 'lazy'])])]
Run Code Online (Sandbox Code Playgroud)

在龙:


首先,必须注意斯坦福NLP工具是用Java编写的,NLTK是用Python编写的.NLTK连接工具的方式是通过命令行界面调用Java工具.

其次,NLTK自版本3.1以来,斯坦福NLP工具的API发生了很大变化.因此建议将NLTK软件包更新到v3.1.

第三,NLTK斯坦福NLP工具的API包含各个NLP工具,例如Stanford POS tagger,Stanford NER Tagger,Stanford Parser.

对于POS和NER标记器,它包围Stanford Core NLP包.

对于Stanford Parser来说,这是一个特殊情况,包括斯坦福分析器和斯坦福核心NLP(个人而言,我没有使用NLTK后者,我宁愿在http://www.eecs上关注@dimazest的演示. qmul.ac.uk/~dm303/stanford-dependency-parser-nltk-and-anaconda.html)

注意,NLTK V3.1中,STANFORD_JAR并且STANFORD_PARSER变量已被弃用,不再使用


在更长的时间:


步骤1

假设您已在您的操作系统上正确安装了Java.

现在,安装/更新您的NLTK版本(请参阅http://www.nltk.org/install.html):

  • 使用点子:sudo pip install -U nltk
  • Debian发行版(使用apt-get):sudo apt-get install python-nltk

对于Windows(使用32位二进制安装):

  1. 安装Python 3.4:http://www.python.org/downloads/(避免使用64位版本)
  2. 安装Numpy(可选):http://sourceforge.net/projects/numpy/files/NumPy/(指定pythnon3.4的版本)
  3. 安装NLTK:http://pypi.python.org/pypi/nltk
  4. 测试安装:开始> Python34,然后输入import nltk

(为什么不是64位?请参阅https://github.com/nltk/nltk/issues/1079)


然后出于偏执,nltk在python中重新检查你的版本:

from __future__ import print_function
import nltk
print(nltk.__version__)
Run Code Online (Sandbox Code Playgroud)

或者在命令行上:

python3 -c "import nltk; print(nltk.__version__)"
Run Code Online (Sandbox Code Playgroud)

确保您将其3.1视为输出.

对于更多的偏执狂,请检查所有您最喜爱的Stanford NLP工具API是否可用:

from nltk.parse.stanford import StanfordParser
from nltk.parse.stanford import StanfordDependencyParser
from nltk.parse.stanford import StanfordNeuralDependencyParser
from nltk.tag.stanford import StanfordPOSTagger, StanfordNERTagger
from nltk.tokenize.stanford import StanfordTokenizer
Run Code Online (Sandbox Code Playgroud)

(注意:上面的导入只能确保您使用包含这些API的正确NLTK版本.导入中没有看到错误并不意味着您已成功配置NLTK API以使用Stanford工具)


第2步

现在您已经检查过您是否拥有包含必要的Stanford NLP工具界面的正确版本的NLTK.您需要下载并提取所有必需的Stanford NLP工具.

TL; DR,在Unix中:

cd $HOME

# Download the Stanford NLP tools
wget http://nlp.stanford.edu/software/stanford-ner-2015-04-20.zip
wget http://nlp.stanford.edu/software/stanford-postagger-full-2015-04-20.zip
wget http://nlp.stanford.edu/software/stanford-parser-full-2015-04-20.zip
# Extract the zip file.
unzip stanford-ner-2015-04-20.zip 
unzip stanford-parser-full-2015-04-20.zip 
unzip stanford-postagger-full-2015-04-20.zip
Run Code Online (Sandbox Code Playgroud)

在Windows/Mac中:


第3步

设置环境变量,使NLTK可以自动找到相关的文件路径.您必须设置以下变量:

  • 将适当的Stanford NLP .jar文件添加到 CLASSPATH环境变量中.

    • 例如,对于NER,它将是 stanford-ner-2015-04-20/stanford-ner.jar
    • 例如,对于POS,它将是 stanford-postagger-full-2015-04-20/stanford-postagger.jar
    • 例如,对于解析器,它将是stanford-parser-full-2015-04-20/stanford-parser.jar和解析器模型jar文件,stanford-parser-full-2015-04-20/stanford-parser-3.5.2-models.jar
  • 将相应的模型目录添加到STANFORD_MODELS变量(即可以找到保存预训练模型的目录)

    • 例如,对于NER,它将在 stanford-ner-2015-04-20/classifiers/
    • 例如,对于POS,它将在 stanford-postagger-full-2015-04-20/models/
    • 例如,对于Parser,将不会有模型目录.

在代码中,看到它STANFORD_MODELS在附加模型名称之前搜索目录.另请注意,API还会自动尝试在OS环境中搜索`CLASSPATH)

请注意,自NLTK v3.1起,STANFORD_JAR变量已弃用且不再使用.以下Stackoverflow问题中找到的代码段可能不起作用:

TL; Ubuntu上的STEP 3 DR

export STANFORDTOOLSDIR=/home/path/to/stanford/tools/

export CLASSPATH=$STANFORDTOOLSDIR/stanford-postagger-full-2015-04-20/stanford-postagger.jar:$STANFORDTOOLSDIR/stanford-ner-2015-04-20/stanford-ner.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-04-20/stanford-parser.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-04-20/stanford-parser-3.5.2-models.jar

export STANFORD_MODELS=$STANFORDTOOLSDIR/stanford-postagger-full-2015-04-20/models:$STANFORDTOOLSDIR/stanford-ner-2015-04-20/classifiers
Run Code Online (Sandbox Code Playgroud)

(对于Windows:有关设置环境变量的说明,请参阅/sf/answers/1202349641/)

必须在开始python之前设置如上所述的变量,然后:

>>> from nltk.tag.stanford import StanfordPOSTagger
>>> st = StanfordPOSTagger('english-bidirectional-distsim.tagger')
>>> st.tag('What is the airspeed of an unladen swallow ?'.split())
[(u'What', u'WP'), (u'is', u'VBZ'), (u'the', u'DT'), (u'airspeed', u'NN'), (u'of', u'IN'), (u'an', u'DT'), (u'unladen', u'JJ'), (u'swallow', u'VB'), (u'?', u'.')]

>>> from nltk.tag import StanfordNERTagger
>>> st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') 
>>> st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
[(u'Rami', u'PERSON'), (u'Eid', u'PERSON'), (u'is', u'O'), (u'studying', u'O'), (u'at', u'O'), (u'Stony', u'ORGANIZATION'), (u'Brook', u'ORGANIZATION'), (u'University', u'ORGANIZATION'), (u'in', u'O'), (u'NY', u'O')]


>>> from nltk.parse.stanford import StanfordParser
>>> parser=StanfordParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
>>> list(parser.raw_parse("the quick brown fox jumps over the lazy dog"))
[Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['brown']), Tree('NN', ['fox'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['dog'])])])])])])]
Run Code Online (Sandbox Code Playgroud)

或者,您可以尝试在python中添加环境变量,如前面的答案所示,但您也可以直接告诉解析器/标记器初始化到保存.jar文件和模型的直接路径.

有没有必要,如果您使用以下方法来设置环境变量,在API改变其参数名称,则需要相应地改变.这就是为什么设置环境变量比修改你的python代码以适应NLTK版本更合适的原因.

例如(不设置任何环境变量):

# POS tagging:

from nltk.tag import StanfordPOSTagger

stanford_pos_dir = '/home/alvas/stanford-postagger-full-2015-04-20/'
eng_model_filename= stanford_pos_dir + 'models/english-left3words-distsim.tagger'
my_path_to_jar= stanford_pos_dir + 'stanford-postagger.jar'

st = StanfordPOSTagger(model_filename=eng_model_filename, path_to_jar=my_path_to_jar) 
st.tag('What is the airspeed of an unladen swallow ?'.split())


# NER Tagging:
from nltk.tag import StanfordNERTagger

stanford_ner_dir = '/home/alvas/stanford-ner/'
eng_model_filename= stanford_ner_dir + 'classifiers/english.all.3class.distsim.crf.ser.gz'
my_path_to_jar= stanford_ner_dir + 'stanford-ner.jar'

st = StanfordNERTagger(model_filename=eng_model_filename, path_to_jar=my_path_to_jar) 
st.tag('Rami Eid is studying at Stony Brook University in NY'.split())

# Parsing:
from nltk.parse.stanford import StanfordParser

stanford_parser_dir = '/home/alvas/stanford-parser/'
eng_model_path = stanford_parser_dir  + "edu/stanford/nlp/models/lexparser/englishRNN.ser.gz"
my_path_to_models_jar = stanford_parser_dir  + "stanford-parser-3.5.2-models.jar"
my_path_to_jar = stanford_parser_dir  + "stanford-parser.jar"

parser=StanfordParser(model_path=eng_model_path, path_to_models_jar=my_path_to_models_jar, path_to_jar=my_path_to_jar)
Run Code Online (Sandbox Code Playgroud)


alv*_*vas 22

弃用的答案

以下答案已弃用,请使用/sf/answers/3638709651/上的解决方案获取NLTK v3.3及更高版本.


编辑

截至目前的斯坦福解析器(2015-04-20),其默认输出lexparser.sh已更改,因此下面的脚本将无法正常工作.

但是这个答案是为了传统而保留的,它仍然适用于http://nlp.stanford.edu/software/stanford-parser-2012-11-12.zip.


原始答案

我建议你不要乱用Jython,JPype.让python做python的东西,让java做java的东西,通过控制台获取Stanford Parser输出.

在您的主目录中安装Stanford Parser~/,只需使用此python配方即可获得平坦的括号内解析:

import os
sentence = "this is a foo bar i want to parse."

os.popen("echo '"+sentence+"' > ~/stanfordtemp.txt")
parser_out = os.popen("~/stanford-parser-2012-11-12/lexparser.sh ~/stanfordtemp.txt").readlines()

bracketed_parse = " ".join( [i.strip() for i in parser_out if i.strip()[0] == "("] )
print bracketed_parse
Run Code Online (Sandbox Code Playgroud)

  • 小心这个.如果你的输入包含任何's',你会得到一些奇怪的错误.[有更好的方法](https://docs.python.org/2/library/subprocess.html#subprocess.call)在命令行上调用东西 (3认同)

alv*_*vas 16

从NLTK v3.3开始,用户应避免使用Stanford NER或POS标签nltk.tag,并避免使用 Stanford tokenizer/segmenter nltk.tokenize.

而是使用新的nltk.parse.corenlp.CoreNLPParserAPI.

请参阅https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK


(避免链接回答,我已经从NLTK github wiki下面粘贴了文档)

首先,更新您的NLTK

pip3 install -U nltk # Make sure is >=3.3
Run Code Online (Sandbox Code Playgroud)

然后下载必要的CoreNLP包:

cd ~
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip
unzip stanford-corenlp-full-2018-02-27.zip
cd stanford-corenlp-full-2018-02-27

# Get the Chinese model 
wget http://nlp.stanford.edu/software/stanford-chinese-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties 

# Get the Arabic model
wget http://nlp.stanford.edu/software/stanford-arabic-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-arabic.properties 

# Get the French model
wget http://nlp.stanford.edu/software/stanford-french-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-french.properties 

# Get the German model
wget http://nlp.stanford.edu/software/stanford-german-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-german.properties 


# Get the Spanish model
wget http://nlp.stanford.edu/software/stanford-spanish-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-spanish.properties 
Run Code Online (Sandbox Code Playgroud)

英语

仍在stanford-corenlp-full-2018-02-27目录中,启动服务器:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-preload tokenize,ssplit,pos,lemma,ner,parse,depparse \
-status_port 9000 -port 9000 -timeout 15000 & 
Run Code Online (Sandbox Code Playgroud)

然后在Python中:

>>> from nltk.parse import CoreNLPParser

# Lexical Parser
>>> parser = CoreNLPParser(url='http://localhost:9000')

# Parse tokenized text.
>>> list(parser.parse('What is the airspeed of an unladen swallow ?'.split()))
[Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['airspeed'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['unladen'])])]), Tree('S', [Tree('VP', [Tree('VB', ['swallow'])])])])]), Tree('.', ['?'])])])]

# Parse raw string.
>>> list(parser.raw_parse('What is the airspeed of an unladen swallow ?'))
[Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['airspeed'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['unladen'])])]), Tree('S', [Tree('VP', [Tree('VB', ['swallow'])])])])]), Tree('.', ['?'])])])]

# Neural Dependency Parser
>>> from nltk.parse.corenlp import CoreNLPDependencyParser
>>> dep_parser = CoreNLPDependencyParser(url='http://localhost:9000')
>>> parses = dep_parser.parse('What is the airspeed of an unladen swallow ?'.split())
>>> [[(governor, dep, dependent) for governor, dep, dependent in parse.triples()] for parse in parses]
[[(('What', 'WP'), 'cop', ('is', 'VBZ')), (('What', 'WP'), 'nsubj', ('airspeed', 'NN')), (('airspeed', 'NN'), 'det', ('the', 'DT')), (('airspeed', 'NN'), 'nmod', ('swallow', 'VB')), (('swallow', 'VB'), 'case', ('of', 'IN')), (('swallow', 'VB'), 'det', ('an', 'DT')), (('swallow', 'VB'), 'amod', ('unladen', 'JJ')), (('What', 'WP'), 'punct', ('?', '.'))]]


# Tokenizer
>>> parser = CoreNLPParser(url='http://localhost:9000')
>>> list(parser.tokenize('What is the airspeed of an unladen swallow?'))
['What', 'is', 'the', 'airspeed', 'of', 'an', 'unladen', 'swallow', '?']

# POS Tagger
>>> pos_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
>>> list(pos_tagger.tag('What is the airspeed of an unladen swallow ?'.split()))
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]

# NER Tagger
>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> list(ner_tagger.tag(('Rami Eid is studying at Stony Brook University in NY'.split())))
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'STATE_OR_PROVINCE')]
Run Code Online (Sandbox Code Playgroud)

中文

以不同的方式启动服务器,仍然来自`stanford-corenlp-full-2018-02-27目录:

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-chinese.properties \
-preload tokenize,ssplit,pos,lemma,ner,parse \
-status_port 9001  -port 9001 -timeout 15000
Run Code Online (Sandbox Code Playgroud)

在Python中:

>>> parser = CoreNLPParser('http://localhost:9001')
>>> list(parser.tokenize(u'???????'))
['??', '??', '??', '?']

>>> list(parser.parse(parser.tokenize(u'???????')))
[Tree('ROOT', [Tree('IP', [Tree('IP', [Tree('NP', [Tree('NN', ['??'])]), Tree('VP', [Tree('VE', ['??']), Tree('NP', [Tree('NN', ['??'])])])]), Tree('PU', ['?'])])])]
Run Code Online (Sandbox Code Playgroud)

阿拉伯

启动服务器:

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-arabic.properties \
-preload tokenize,ssplit,pos,parse \
-status_port 9005  -port 9005 -timeout 15000
Run Code Online (Sandbox Code Playgroud)

在Python中:

>>> from nltk.parse import CoreNLPParser
>>> parser = CoreNLPParser('http://localhost:9005')
>>> text = u'??? ????'

# Parser.
>>> parser.raw_parse(text)
<list_iterator object at 0x7f0d894c9940>
>>> list(parser.raw_parse(text))
[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('PRP', ['???'])]), Tree('NP', [Tree('NN', ['????'])])])])]
>>> list(parser.parse(parser.tokenize(text)))
[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('PRP', ['???'])]), Tree('NP', [Tree('NN', ['????'])])])])]

# Tokenizer / Segmenter.
>>> list(parser.tokenize(text))
['???', '????']

# POS tagg
>>> pos_tagger = CoreNLPParser('http://localhost:9005', tagtype='pos')
>>> list(pos_tagger.tag(parser.tokenize(text)))
[('???', 'PRP'), ('????', 'NN')]


# NER tag
>>> ner_tagger = CoreNLPParser('http://localhost:9005', tagtype='ner')
>>> list(ner_tagger.tag(parser.tokenize(text)))
[('???', 'O'), ('????', 'O')]
Run Code Online (Sandbox Code Playgroud)

法国

启动服务器:

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-french.properties \
-preload tokenize,ssplit,pos,parse \
-status_port 9004  -port 9004 -timeout 15000
Run Code Online (Sandbox Code Playgroud)

在Python中:

>>> parser = CoreNLPParser('http://localhost:9004')
>>> list(parser.parse('Je suis enceinte'.split()))
[Tree('ROOT', [Tree('SENT', [Tree('NP', [Tree('PRON', ['Je']), Tree('VERB', ['suis']), Tree('AP', [Tree('ADJ', ['enceinte'])])])])])]
>>> pos_tagger = CoreNLPParser('http://localhost:9004', tagtype='pos')
>>> pos_tagger.tag('Je suis enceinte'.split())
[('Je', 'PRON'), ('suis', 'VERB'), ('enceinte', 'ADJ')]
Run Code Online (Sandbox Code Playgroud)

德语

启动服务器:

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-german.properties \
-preload tokenize,ssplit,pos,ner,parse \
-status_port 9002  -port 9002 -timeout 15000
Run Code Online (Sandbox Code Playgroud)

在Python中:

>>> parser = CoreNLPParser('http://localhost:9002')
>>> list(parser.raw_parse('Ich bin schwanger'))
[Tree('ROOT', [Tree('NUR', [Tree('S', [Tree('PPER', ['Ich']), Tree('VAFIN', ['bin']), Tree('AP', [Tree('ADJD', ['schwanger'])])])])])]
>>> list(parser.parse('Ich bin schwanger'.split()))
[Tree('ROOT', [Tree('NUR', [Tree('S', [Tree('PPER', ['Ich']), Tree('VAFIN', ['bin']), Tree('AP', [Tree('ADJD', ['schwanger'])])])])])]


>>> pos_tagger = CoreNLPParser('http://localhost:9002', tagtype='pos')
>>> pos_tagger.tag('Ich bin schwanger'.split())
[('Ich', 'PPER'), ('bin', 'VAFIN'), ('schwanger', 'ADJD')]

>>> pos_tagger = CoreNLPParser('http://localhost:9002', tagtype='pos')
>>> pos_tagger.tag('Ich bin schwanger'.split())
[('Ich', 'PPER'), ('bin', 'VAFIN'), ('schwanger', 'ADJD')]

>>> ner_tagger = CoreNLPParser('http://localhost:9002', tagtype='ner')
>>> ner_tagger.tag('Donald Trump besuchte Angela Merkel in Berlin.'.split())
[('Donald', 'PERSON'), ('Trump', 'PERSON'), ('besuchte', 'O'), ('Angela', 'PERSON'), ('Merkel', 'PERSON'), ('in', 'O'), ('Berlin', 'LOCATION'), ('.', 'O')]
Run Code Online (Sandbox Code Playgroud)

西班牙语

启动服务器:

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-spanish.properties \
-preload tokenize,ssplit,pos,ner,parse \
-status_port 9003  -port 9003 -timeout 15000
Run Code Online (Sandbox Code Playgroud)

在Python中:

>>> pos_tagger = CoreNLPParser('http://localhost:9003', tagtype='pos')
>>> pos_tagger.tag(u'Barack Obama salió con Michael Jackson .'.split())
[('Barack', 'PROPN'), ('Obama', 'PROPN'), ('salió', 'VERB'), ('con', 'ADP'), ('Michael', 'PROPN'), ('Jackson', 'PROPN'), ('.', 'PUNCT')]
>>> ner_tagger = CoreNLPParser('http://localhost:9003', tagtype='ner')
>>> ner_tagger.tag(u'Barack Obama salió con Michael Jackson .'.split())
[('Barack', 'PERSON'), ('Obama', 'PERSON'), ('salió', 'O'), ('con', 'O'), ('Michael', 'PERSON'), ('Jackson', 'PERSON'), ('.', 'O')]
Run Code Online (Sandbox Code Playgroud)


sil*_*asm 7

Stanford Core NLP软件页面有一个python包装器列表:

http://nlp.stanford.edu/software/corenlp.shtml#Extensions


bob*_*ope 6

如果我记得很清楚,斯坦福解析器是一个java库,因此您必须在服务器/计算机上运行Java解释器.

我曾经使用它一次服务器,结合PHP脚本.该脚本使用php的exec()函数对解析器进行命令行调用,如下所示:

<?php

exec( "java -cp /pathTo/stanford-parser.jar -mx100m edu.stanford.nlp.process.DocumentPreprocessor /pathTo/fileToParse > /pathTo/resultFile 2>/dev/null" );

?>
Run Code Online (Sandbox Code Playgroud)

我不记得这个命令的所有细节,它基本上打开了fileToParse,解析了它,并在resultFile中写了输出.然后,PHP将打开结果文件以供进一步使用.

命令的结尾将解析器的详细信息指向NULL,以防止不必要的命令行信息干扰脚本.

我对Python知之甚少,但可能有一种方法可以进行命令行调用.

它可能不是你希望的确切路线,但希望它会给你一些灵感.祝你好运.


小智 6

请注意,此答案适用于NLTK v 3.0,而不适用于更新版本.

以下是对windoze中nltk3.0.0一起使用的danger98代码的改编,也可能是其他平台,根据您的设置调整目录名称:

import os
from nltk.parse import stanford
os.environ['STANFORD_PARSER'] = 'd:/stanford-parser'
os.environ['STANFORD_MODELS'] = 'd:/stanford-parser'
os.environ['JAVAHOME'] = 'c:/Program Files/java/jre7/bin'

parser = stanford.StanfordParser(model_path="d:/stanford-grammars/englishPCFG.ser.gz")
sentences = parser.raw_parse_sents(("Hello, My name is Melroy.", "What is your name?"))
print sentences
Run Code Online (Sandbox Code Playgroud)

请注意,解析命令已更改(请参阅www.nltk.org/_modules/nltk/parse/stanford.html上的源代码),并且您需要定义JAVAHOME变量.我试图让它在jar中原位读取语法文件,但到目前为止还没有做到.


归档时间:

查看次数:

86675 次

最近记录:

6 年,5 月 前