pyspark,逻辑回归,如何获取各个特征的系数

Goi*_*Way 3 python apache-spark pyspark apache-spark-mllib

我是新来的Spark,我当前的版本是1.3.1。我想用 实现逻辑回归PySpark,所以,我从Spark Python MLlib找到了这个例子

from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.regression import LabeledPoint
from numpy import array

# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0], values[1:])

data = sc.textFile("data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)

# Build the model
model = LogisticRegressionWithLBFGS.train(parsedData)

# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))
Run Code Online (Sandbox Code Playgroud)

我发现的属性model是:

In [21]: model.<TAB>
model.clearThreshold  model.predict         model.weights
model.intercept       model.setThreshold  
Run Code Online (Sandbox Code Playgroud)

如何获得逻辑回归的系数?

Alb*_*nto 6

正如您所注意到的,获取系数的方法是使用LogisticRegressionModel的属性。

\n
\n

参数:

\n

权重\xe2\x80\x93 为每个特征计算的权重。

\n

截距\xe2\x80\x93 为此模型计算的截距。(仅用于二元 Logistic 回归。在多项 Logistic 回归中,截距不是单个值,因此截距将是权重的一部分。)

\n

numFeatures \xe2\x80\x93 特征的维度。

\n

numClasses \xe2\x80\x93 多项 Logistic 回归中 k 类分类问题的可能结果数。默认情况下,\nit 是二元逻辑回归,因此 numClasses 将设置为 2。

\n
\n

不要忘记,h\xce\xb8(x) = 1 / exp ^ -(\xce\xb80 + \xce\xb81 * x1 + ... + \xce\xb8n * xn)其中\xce\xb80代表intercept[\xce\xb81,...,\xce\xb8n]weights特征数量为n

\n

编辑

\n

正如您所看到的,这就是预测的完成方式,您可以检查LogisticRegressionModel的源代码。

\n
def predict(self, x):\n    """\n    Predict values for a single data point or an RDD of points\n    using the model trained.\n    """\n    if isinstance(x, RDD):\n        return x.map(lambda v: self.predict(v))\n\n    x = _convert_to_vector(x)\n    if self.numClasses == 2:\n        margin = self.weights.dot(x) + self._intercept\n        if margin > 0:\n            prob = 1 / (1 + exp(-margin))\n        else:\n            exp_margin = exp(margin)\n            prob = exp_margin / (1 + exp_margin)\n        if self._threshold is None:\n            return prob\n        else:\n            return 1 if prob > self._threshold else 0\n    else:\n        best_class = 0\n        max_margin = 0.0\n        if x.size + 1 == self._dataWithBiasSize:\n            for i in range(0, self._numClasses - 1):\n                margin = x.dot(self._weightsMatrix[i][0:x.size]) + \\\n                    self._weightsMatrix[i][x.size]\n                if margin > max_margin:\n                    max_margin = margin\n                    best_class = i + 1\n        else:\n            for i in range(0, self._numClasses - 1):\n                margin = x.dot(self._weightsMatrix[i])\n                if margin > max_margin:\n                    max_margin = margin\n                    best_class = i + 1\n        return best_class\n
Run Code Online (Sandbox Code Playgroud)\n