使用scikit-learn训练NER的NLP日志线性模型

Fra*_*urt 6 nlp scikit-learn

我想知道如何使用sklearn.linear_model.LogisticRegressionNLP日志线性模型进行命名实体识别(NER).

用于定义条件概率的典型对数线性模型如下:

在此输入图像描述

有:

  • x:当前的单词
  • y:正在考虑的单词的类
  • f:特征向量函数,它将单词x和类y映射到标量向量.
  • v:特征权重向量

可以sklearn.linear_model.LogisticRegression训练这样的模型吗?

问题是功能取决于类.

Fra*_*urt 7

在scikit-learn 0.16及更高版本中,您可以使用该multinomial选项sklearn.linear_model.LogisticRegression来训练对数线性模型(又名MaxEnt分类器,多类逻辑回归).目前,该multinomial选项由'lbfgs'和'newton-cg'解算器支持.

Iris数据集示例(4个功能,3个类,150个样本):

#!/usr/bin/python
# -*- coding: utf-8 -*-

from __future__ import print_function
from __future__ import division

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# Import data 
iris = datasets.load_iris()
X = iris.data # features
y_true = iris.target # labels

# Look at the size of the feature matrix and the label vector:
print('iris.data.shape: {0}'.format(iris.data.shape))
print('iris.target.shape: {0}\n'.format(iris.target.shape))

#  Instantiate a MaxEnt model
logreg = linear_model.LogisticRegression(C=1e5, multi_class='multinomial', solver='lbfgs')

# Train the model
logreg.fit(X, y_true)
print('logreg.coef_: \n{0}\n'.format(logreg.coef_))
print('logreg.intercept_: \n{0}'.format(logreg.intercept_))

# Use the model to make predictions
y_pred = logreg.predict(X)
print('\ny_pred: \n{0}'.format(y_pred))

# Assess the quality of the predictions
print('\nconfusion_matrix(y_true, y_pred):\n{0}\n'.format(confusion_matrix(y_true, y_pred)))
print('classification_report(y_true, y_pred): \n{0}'.format(classification_report(y_true, y_pred)))
Run Code Online (Sandbox Code Playgroud)

multinomial对选项sklearn.linear_model.LogisticRegression 是在0.16版本中引入的:

  • multi_class="multinomial"在:class中添加选项:linear_model.LogisticRegression实现Logistic回归求解器,最小化交叉熵或多项式损失,而不是默认的One-vs-Rest设置.支持lbfgsnewton-cg解决方案.由Lars Buitinck_和Manoj Kumar_.newton-cgSimon Wu的求解器选项 .