我最近一直试图了解Apache Spark作为Scikit Learn的替代品,但在我看来,即使在简单的情况下,Scikit也会比Spark更快地收敛到精确模型.例如,我使用以下脚本为非常简单的线性函数(z = x + y)生成了1000个数据点:
from random import random
def func(in_vals):
'''result = x (+y+z+w....)'''
result = 0
for v in in_vals:
result += v
return result
if __name__ == "__main__":
entry_count = 1000
dim_count = 2
in_vals = [0]*dim_count
with open("data_yequalsx.csv", "w") as out_file:
for entry in range(entry_count):
for i in range(dim_count):
in_vals[i] = random()
out_val = func(in_vals)
out_file.write(','.join([str(x) for x in in_vals]))
out_file.write(",%s\n" % str(out_val))
Run Code Online (Sandbox Code Playgroud)
然后我运行了以下Scikit脚本:
import sklearn
from sklearn import linear_model
import numpy as np …Run Code Online (Sandbox Code Playgroud) machine-learning linear-regression scikit-learn apache-spark