在python中使用以下代码用于svm:
from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='auto'))
clf.fit(X, y)
proba = clf.predict_proba(X)
Run Code Online (Sandbox Code Playgroud)
但这需要花费大量时间.
实际数据维度:
train-set (1422392,29)
test-set (233081,29)
Run Code Online (Sandbox Code Playgroud)
我怎样才能加快速度(平行或其他方式)?请帮忙.我已经尝试过PCA和下采样.
我有6节课.编辑:发现http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html 但我希望进行概率估计,而且对于svm来说似乎并非如此.
编辑:
from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC,LinearSVC
from sklearn.linear_model import SGDClassifier
import joblib
import numpy as np
from sklearn import grid_search
import multiprocessing
import numpy as np
import math
def …
Run Code Online (Sandbox Code Playgroud) 我希望在R x64 3.1.2中加载rJava.OS- Windows 8.1 64位
虽然安装似乎工作正常:
> install.packages("rJava")
Installing package into ‘C:/Users/sony/Documents/R/win-library/3.1’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
trying URL 'http://cran.utstat.utoronto.ca/bin/windows/contrib/3.1/rJava_0.9-6.zip'
Content type 'application/zip' length 758898 bytes (741 Kb)
opened URL
downloaded 741 Kb
package ‘rJava’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\sony\AppData\Local\Temp\RtmpamYUH7\downloaded_packages
Run Code Online (Sandbox Code Playgroud)
加载包时出错:
library(rJava)
Error in get(Info[i, 1], envir = env) :
lazy-load database 'C:/Users/sony/Documents/R/win-library/3.1/rJava/R/rJava.rdb' is corrupt
In addition: Warning message:
In …
Run Code Online (Sandbox Code Playgroud) gonzo ? ~/a/packages ? conda env list
# conda environments:
#
ppo_latest /nohome/jaan/abhishek/anaconda3/envs/ppo_latest
root * /nohome/jaan/abhishek/anaconda3
gonzo ? ~/a/packages ? conda activate ppo_latest
gonzo ? ~/a/packages ? which python (ppo_latest)
/nohome/jaan/abhishek/anaconda3/bin/python
gonzo ? ~/a/packages ? conda deactivate (ppo_latest)
gonzo ? ~/a/packages ? which python
/nohome/jaan/abhishek/anaconda3/bin/python
Run Code Online (Sandbox Code Playgroud)
环境被激活而没有错误.然后我们检查它指的是哪个python.它不会改变,为什么?
import torch,ipdb
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
rnn = nn.LSTM(input_size=10, hidden_size=20, num_layers=2)
input = Variable(torch.randn(5, 3, 10))
h0 = Variable(torch.randn(2, 3, 20))
c0 = Variable(torch.randn(2, 3, 20))
output, hn = rnn(input, (h0, c0))
Run Code Online (Sandbox Code Playgroud)
这是文档中的LSTM示例.我不明白以下事项:
编辑:
import torch,ipdb
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
import torch.nn.functional as F
num_layers=3 …
Run Code Online (Sandbox Code Playgroud) 如何获得PCA应用的特征值和特征向量?
from sklearn.decomposition import PCA
clf=PCA(0.98,whiten=True) #converse 98% variance
X_train=clf.fit_transform(X_train)
X_test=clf.transform(X_test)
Run Code Online (Sandbox Code Playgroud)
我在文档中找不到它.
我"不能"理解这里的不同结果.
编辑:
def pca_code(data):
#raw_implementation
var_per=.98
data-=np.mean(data, axis=0)
data/=np.std(data, axis=0)
cov_mat=np.cov(data, rowvar=False)
evals, evecs = np.linalg.eigh(cov_mat)
idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
evals = evals[idx]
variance_retained=np.cumsum(evals)/np.sum(evals)
index=np.argmax(variance_retained>=var_per)
evecs = evecs[:,:index+1]
reduced_data=np.dot(evecs.T, data.T).T
print(evals)
print("_"*30)
print(evecs)
print("_"*30)
#using scipy package
clf=PCA(var_per)
X_train=data.T
X_train=clf.fit_transform(X_train)
print(clf.explained_variance_)
print("_"*30)
print(clf.components_)
print("__"*30)
Run Code Online (Sandbox Code Playgroud)
ipdb> outputs.size()
torch.Size([10, 100])
ipdb> print sum(outputs,0).size(),sum(outputs,1).size(),sum(outputs,2).size()
(100L,) (100L,) (100L,)
Run Code Online (Sandbox Code Playgroud)
如何对列进行求和?
我通过以下导入获得以下错误.它似乎与大熊猫导入有关.我不确定如何调试/解决这个问题.
进口:
import pandas as pd
import numpy as np
import pdb, math, pickle
import matplotlib.pyplot as plt
Run Code Online (Sandbox Code Playgroud)
错误:
In [1]: %run NN.py
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
/home/abhishek/Desktop/submission/a1/new/NN.py in <module>()
2 import numpy as np
3 import pdb, math, pickle
----> 4 import matplotlib.pyplot as plt
5
6 class NN(object):
/home/abhishek/anaconda3/lib/python3.5/site-packages/matplotlib/pyplot.py in <module>()
112
113 from matplotlib.backends import pylab_setup
--> 114 _backend_mod, new_figure_manager, draw_if_interactive, _show = pylab_setup()
115
116 _IP_REGISTERED = None
/home/abhishek/anaconda3/lib/python3.5/site-packages/matplotlib/backends/__init__.py in pylab_setup()
30 # …
Run Code Online (Sandbox Code Playgroud) 我正在从html文件中读取文本并进行一些分析.这些.html文件是新闻文章.
码:
html = open(filepath,'r').read()
raw = nltk.clean_html(html)
raw.unidecode(item.decode('utf8'))
Run Code Online (Sandbox Code Playgroud)
现在我只想要文章内容,而不是广告,标题等其他文本.我怎么能在python中相对准确地这样做?
我知道一些像Jsoup(java api)和bolier这样的工具,但我想在python中这样做.我可以找到一些使用bs4的技术,但仅限于一种类型的页面.我有来自众多来源的新闻页面.此外,还缺少任何示例代码示例.
我在python中寻找与http://www.psl.cs.columbia.edu/wp-content/uploads/2011/03/3463-WWWJ.pdf完全相同的内容.
编辑: 为了更好地理解,请写一个示例代码来提取以下链接的内容http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always -a-高癌症risk.html?SRC =我和REF =一般
>>> import boilerpipe
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda\lib\site-packages\boilerpipe\__init__.py", line 10, in <module>
jpype.startJVM(jpype.getDefaultJVMPath(), "-Djava.class.path=%s" % os.pathsep.join(jars))
File "C:\Anaconda\lib\site-packages\jpype\_core.py", line 50, in startJVM
_jpype.startup(jvm, tuple(args), True)
RuntimeError: Unable to load DLL [C:\Program Files\Java\jre7\bin\client\jvm.dll], error = The specified module could not be found.
at native\common\include\jp_platform_win32.h:58
Run Code Online (Sandbox Code Playgroud)
尝试:重新安装jvm
>> import ctypes
>> import os
>> os.chdir(r"<path to Java bin client folder>")
>> ctypes.CDLL("jvm.dll")
Still unable to fix
Run Code Online (Sandbox Code Playgroud)
编辑:尝试下面的代码,仍然卡住:
from py4j.java_gateway import JavaGateway
gateway = JavaGateway()
它给出了与以前相同的错误.
fig = plt.figure();
ax=plt.gca()
ax.scatter(x,y,c="blue",alpha=0.95,edgecolors='none')
ax.set_yscale('log')
ax.set_xscale('log')
(Pdb) print x,y
[29, 36, 8, 32, 11, 60, 16, 242, 36, 115, 5, 102, 3, 16, 71, 0, 0, 21, 347, 19, 12, 162, 11, 224, 20, 1, 14, 6, 3, 346, 73, 51, 42, 37, 251, 21, 100, 11, 53, 118, 82, 113, 21, 0, 42, 42, 105, 9, 96, 93, 39, 66, 66, 33, 354, 16, 602]
[310000, 150000, 70000, 30000, 50000, 150000, 2000, 12000, 2500, 10000, 12000, 500, 3000, …
Run Code Online (Sandbox Code Playgroud)