小编dic*_*e89的帖子

如何并行化 scikit-learn SVM (SVC) 分类器的 .predict() 方法？

我最近遇到一个要求，即我有一个.fit()训练有素的scikit-learn SVC分类器实例并且需要.predict()很多实例。

有没有办法.predict()通过任何scikit-learn内置工具仅并行化此方法？

from sklearn import svm

data_train = [[0,2,3],[1,2,3],[4,2,3]]
targets_train = [0,1,0]

clf = svm.SVC(kernel='rbf', degree=3, C=10, gamma=0.3, probability=True)
clf.fit(data_train, targets_train)

# this can be very large (~ a million records)
to_be_predicted = [[1,3,4]]
clf.predict(to_be_predicted)

Run Code Online (Sandbox Code Playgroud)

如果有人确实知道解决方案，如果您能分享它，我会非常高兴。

python concurrency scikit-learn

dic*_*e89

2016 06-27

8
推荐指数

2
解决办法

3765
查看次数

从Apache Spark中的文件联合一系列RDD的内存有效方式

我目前正在尝试在UMBC Webbase语料库上训练一组Word2Vec向量(大约30GB的文本在400个文件中).

即使在100 GB以上的机器上,我也经常遇到内存不足的情况.我在应用程序本身运行Spark.我尝试稍微调整一下,但我无法对超过10 GB的文本数据执行此操作.我实现的明显瓶颈是先前计算的RDD的并集,即内存不足异常的来源.

也许您有经验可以提出比这更有效的内存实现:

 object SparkJobs {
  val conf = new SparkConf()
    .setAppName("TestApp")
    .setMaster("local[*]")
    .set("spark.executor.memory", "100g")
    .set("spark.rdd.compress", "true")

  val sc = new SparkContext(conf)


  def trainBasedOnWebBaseFiles(path: String): Unit = {
    val folder: File = new File(path)

    val files: ParSeq[File] = folder.listFiles(new TxtFileFilter).toIndexedSeq.par


    var i = 0;
    val props = new Properties();
    props.setProperty("annotators", "tokenize, ssplit");
    props.setProperty("nthreads","2")
    val pipeline = new StanfordCoreNLP(props);

    //preprocess files parallel
    val training_data_raw: ParSeq[RDD[Seq[String]]] = files.map(file => {
      //preprocess line of file
      println(file.getName() +"-" + file.getTotalSpace())
      val …

Run Code Online (Sandbox Code Playgroud)

nlp scala bigdata apache-spark word2vec

dic*_*e89

lucky-day

6
推荐指数

1
解决办法

3866
查看次数