Goi*_*Way 3 python apache-spark pyspark apache-spark-mllib
我是新来的Spark,我当前的版本是1.3.1。我想用 实现逻辑回归PySpark,所以,我从Spark Python MLlib找到了这个例子
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.regression import LabeledPoint
from numpy import array
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.split(' ')]
return LabeledPoint(values[0], values[1:])
data = sc.textFile("data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)
# Build the model
model = LogisticRegressionWithLBFGS.train(parsedData)
# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))
Run Code Online (Sandbox Code Playgroud)
我发现的属性model是:
In [21]: model.<TAB>
model.clearThreshold model.predict model.weights
model.intercept model.setThreshold
Run Code Online (Sandbox Code Playgroud)
如何获得逻辑回归的系数?
正如您所注意到的,获取系数的方法是使用LogisticRegressionModel的属性。
\n\n\n参数:
\n权重\xe2\x80\x93 为每个特征计算的权重。
\n截距\xe2\x80\x93 为此模型计算的截距。(仅用于二元 Logistic 回归。在多项 Logistic 回归中,截距不是单个值,因此截距将是权重的一部分。)
\nnumFeatures \xe2\x80\x93 特征的维度。
\nnumClasses \xe2\x80\x93 多项 Logistic 回归中 k 类分类问题的可能结果数。默认情况下,\nit 是二元逻辑回归,因此 numClasses 将设置为 2。
\n
不要忘记,h\xce\xb8(x) = 1 / exp ^ -(\xce\xb80 + \xce\xb81 * x1 + ... + \xce\xb8n * xn)其中\xce\xb80代表intercept、[\xce\xb81,...,\xce\xb8n],weights特征数量为n。
正如您所看到的,这就是预测的完成方式,您可以检查LogisticRegressionModel的源代码。
\ndef predict(self, x):\n """\n Predict values for a single data point or an RDD of points\n using the model trained.\n """\n if isinstance(x, RDD):\n return x.map(lambda v: self.predict(v))\n\n x = _convert_to_vector(x)\n if self.numClasses == 2:\n margin = self.weights.dot(x) + self._intercept\n if margin > 0:\n prob = 1 / (1 + exp(-margin))\n else:\n exp_margin = exp(margin)\n prob = exp_margin / (1 + exp_margin)\n if self._threshold is None:\n return prob\n else:\n return 1 if prob > self._threshold else 0\n else:\n best_class = 0\n max_margin = 0.0\n if x.size + 1 == self._dataWithBiasSize:\n for i in range(0, self._numClasses - 1):\n margin = x.dot(self._weightsMatrix[i][0:x.size]) + \\\n self._weightsMatrix[i][x.size]\n if margin > max_margin:\n max_margin = margin\n best_class = i + 1\n else:\n for i in range(0, self._numClasses - 1):\n margin = x.dot(self._weightsMatrix[i])\n if margin > max_margin:\n max_margin = margin\n best_class = i + 1\n return best_class\nRun Code Online (Sandbox Code Playgroud)\n