我想绘制一个混淆矩阵来显示分类器的性能,但它只显示标签的数量,而不是标签本身:
from sklearn.metrics import confusion_matrix
import pylab as pl
y_test=['business', 'business', 'business', 'business', 'business', 'business', 'business', 'business', 'business', 'business', 'business', 'business', 'business', 'business', 'business', 'business', 'business', 'business', 'business', 'business']
pred=array(['health', 'business', 'business', 'business', 'business',
'business', 'health', 'health', 'business', 'business', 'business',
'business', 'business', 'business', 'business', 'business',
'health', 'health', 'business', 'health'],
dtype='|S8')
cm = confusion_matrix(y_test, pred)
pl.matshow(cm)
pl.title('Confusion matrix of the classifier')
pl.colorbar()
pl.show()
Run Code Online (Sandbox Code Playgroud)
如何将标签(health,business..etc)添加到混淆矩阵中?
我有一个很大的字符串列表,其中包括空格(例如"纽约","美国","北卡罗来纳州","阿拉伯联合酋长国","大不列颠及北爱尔兰联合王国"......以及5000多个这样的字符串).
我有一个大文本,其中可能包括这些字符串中的任何一个(例如"我去北卡罗来纳州途中去了纽约,最终将去阿拉伯联合酋长国.")
有效使用正则表达式检测文本中是否存在这些字符串的最佳方法是什么?
或者也许我应该考虑另一种方式,这样我从文本中提取bigrams并查看此列表中的哪些字符串与这些bigrams匹配?
在与VoronoiPotato进行了一次有趣的讨论之后,我开始认为最好为大字符串列表中的项目的所有标记编制索引,并且我设法使用该函数执行此操作:
def indexing_list(li):
index_dict={}
for i in rl(li):
words=li[i].split()
for j in rl(words):
index=complex(i,j)
word=words[j].lower()
try:
index_dict[word].append(index)
except:
index_dict[word]=[index]
return index_dict
Run Code Online (Sandbox Code Playgroud)
并尝试使用此列表:
[u'United Kingdom of Great Britain and Northern Ireland', u'Democratic People\u2019s Republic of Korea', u'Democratic Republic of the Congo', u'Lao People\u2019s Democratic Republic', u'Saint Vincent and the Grenadines', u'United Republic of Tanzania', u'Iran (Islamic Republic of)', u'Central African Republic', u'Islamic Republic of Iran', u'United States of America', u'Bosnia and Herzegovina', u'Libyan Arab Jamahiriya', …Run Code Online (Sandbox Code Playgroud) python中是否有一个可以将单词(主要是名称)转换为Arpabet语音转录的库?
BARBELS - > B AA1 RB AH0 LZ
BARBEQUE - > B AA1 RB IH0 KY UW2
BARBEQUED - > B AA1 RB IH0 KY UW2 D.
BARBEQUEING - > B AA1 RB IH0 KY UW2 IH0 NG
BARBEQUES - > B AA1 RB IH0 KY UW2 Z.
我想比较两个文件而不管换行符.如果内容相同但换行符的位置和数量不同,我想将一个文档中的行映射到另一个文档中的行.
鉴于:
文件1
I went to Paris in July 15, where I met some nice people.
And I came back
to NY in Aug 15.
I am planning
to go there soon
after I finish what I do.
Run Code Online (Sandbox Code Playgroud)
文件2
I went
to Paris
in July 15,
where I met
some nice people.
And I came back to NY in Aug 15.
I am planning to go
there soon after I finish what I do.
Run Code Online (Sandbox Code Playgroud)
我想要一种算法,能够确定文档1中的第1行包含与文档2中第1行到第5行相同的文本,文档1中的第2行和第3行包含与文档2中的第6行相同的文本,等等.
1 = 1,2,3,4,5
2,3 = …Run Code Online (Sandbox Code Playgroud) 我正在做一个项目,在句子和其他语言的翻译之间建立高精度的词对齐,以衡量翻译质量。我知道 Giza++ 和其他单词对齐工具被用作统计机器翻译管道的一部分,但这不是我要找的。我正在寻找一种算法,可以将源句子中的单词映射到目标句子中的相应单词,并且在考虑到这些限制的情况下透明而准确:
这是我所做的:
这是英语和德语句子之间的相关矩阵的示例。我们可以看到上面讨论的挑战。
图中有一个英文和德文句子对齐的例子,展示了单词之间的相关性,绿色单元格是应该由单词对齐算法识别的正确对齐点。
这是我尝试过的一些内容:
这是我正在使用的代码:
import random
src_words=["I","know","this"]
trg_words=["Ich","kenne","das"]
def match_indexes(word1,word2):
return random.random() #adjust this to get the actual correlation value
all_pairs_vals=[] #list for all the source (src) and taget (trg) indexes and the corresponding correlation values
for i in range(len(src_words)): #iterate over src indexes
src_word=src_words[i] #identify the correponding …Run Code Online (Sandbox Code Playgroud) 我想用一系列具有相同字符数的重复字符替换XML标签.
例如:
<o:LastSaved>2013-01-21T21:15:00Z</o:LastSaved>
Run Code Online (Sandbox Code Playgroud)
我想用以下代替:
#############2013-01-21T21:15:00Z##############
Run Code Online (Sandbox Code Playgroud)
我们如何使用RegEx呢?
我在for循环中运行一个函数,如下所示:
for element in my_list:
my_function(element)
Run Code Online (Sandbox Code Playgroud)
由于某种原因,某些元素可能导致函数进入非常长的处理时间(甚至可能是一些我无法真正追踪其来源的无限循环).所以我想添加一些循环控制来跳过当前元素,如果它的处理例如需要超过2秒.如何才能做到这一点?
我有一个点列表(数量级为数万),我需要使用python识别两件事:
1-这些点中的连续点组(abs(x2-x1)<= 1和abs(y2-y1)<= 1)
2-每组的弧度/半径
以下是一组示例:
[[331,400],[331,1200],[332,400],[332,486],[332,522],[332,655],[332,1200],[332,3800],[ 332,3877],[332,3944],[332,3963],[332,3992],[332,4050],[333,400],[333,486],[333,522],[333, 560],[333,588],[333,655],[333,700],[333,1200],[333,3800],[333,3877],[333,3944],[333,3963] ,[333,3992],[333,4050],[334,400],[334,486],[334,522],[334,558],[334,586],[334,654],[ 334,697],[334,1200],[334,3800],[334,3877],[334,3944],[334,3963],[334,3992],[334,4050],[335, 400],[335,486],[335,521],[335,556],[335,585],[335,653],[335,695],[335,1200],[335,3800] ,[335,3877],[335,3944],[335,3963],[335,3992],[335,4050],[336,400],[336,486],[336,520],[ 336,555],[336,584],[336,651],[336,693],[336,1200],[336,3800],[336,3877],[336,3944],[336,3963],[336,3992],[336,4050],[337,400],[337,486],[337,554],[337,583],[337,649],[337,692] ,[337,1200],[337,3800],[337,3877],[337,3944],[337,3963],[337,3992],[337,4050],[338,377],[ 338,400],[338,486],[338,553],[338,582],[338,647],[338,691],[338,1200],[338,3800],[338, 3877],[338,3944],[338,3963],[338,3992],[338,4050],[339,377],[339,400],[339,486],[339,553] ,[339,581],[339,585],[339,644],[339,654],[339,690],[339,706],[339,1200],[339,3800],[ 339,3877],[339,3944],[339,3963],[339,3992],[339,4050],[340,376],[340,400],[340,486],[340, 552],[340,580],[340,585],[340,641],[340,655],[340,689],[340,713],[340,1200],[340,3800] ,[340,3877],[340,3944],[340,3963],[340,3992],[340,4050],[341,376],[341,400],[341,486],[ 341,552],[341,579],[341,585],[341,639],[341,655],[341,688],[341,715],[341,1200],[341,3800] ,[341,3877],[341,3944],[341,3963],[341,3992],[341,4050],[342,375],[342,400],[342,486],[ 342,552],[342,578],[342,585],[342,637],[342,655],[342,688],[342,717],[342,1200],[342, 3800],[342,3858],[342,3925],[342,3954],[342,4011],[342,4050],[342,4107],[343,374],[343,400] ,[343,486],[343,521],[343,552],[343,577],[343,585],[343,635],[343,642],[343,687],[ 343,718],[343,1200],[343,3800],[343,3858],[343,3925],[343,3954],[343,4011],[343,4050],[343, 4107],[344,373],[344,400],[344,486],[344,521],[344,552],[344,576],[344,585],[344,633] ,[344,642],[344,687],[344,719],[344,1200],[344,3800],[344,3858],[344,3925],[344,3954],[ 344,4011],[344,4050],[344,4107],[345,372],[345,400],[345,486],[345,521],[345,552],[345,575] ,[345,585],[345,630],[345,642],[345,687],[345,720],[345,1200],[345,3800],[345,3858],[ 345,3925],[345,3954],[345,4011],[345,4050],[345,4107],[346,370],[346,400],[346,486],[346, 521],[346,552],[346,574],[346,585],[346,628],[346,642],[346,686],[346,721],[346,1200] ,[346,3800],[346,3858],[346,3925],[346,3954],[346,4011],[346,4050],[346,4107],[347,368],[ 347,400],[347,486],[347,521],[347,552],[347,572],[347,585],[347,626],[347,642],[347, 686],[347,721],[347,1200],[347,3800],[347,3858],[347,3925],[347,3954],[347,4011],[347,4050] ,[347,4107],[348,366],[348,400],[348,487],[348,521],[348,552],[348,570],[348,585],[ 348,624],[348,642],[348,686],[348,721],[348,1200],[348,3800],[348,3858],[348,3925],[348,3954] ,[348,4011],[348,4050],[348,4107],[349,364],[349,400],[349,487],[349,521],[349,553],[ 349,568],[349,585],[349,622],[349,642],[349,686],[349,722],[349,1200],[349,3800],[349, 3858],[349,3925],[349,3954],[349,4011],[349,4050],[349,4107],[350,362],[350,400],[350,487] ,[350,521],[350,553],[350,585],[350,619],[350,642],[350,686],[350,722],[350,1200],[ 350,3800],[350,3858],[350,3925],[350,3954],[350,4011],[350,4050],[350,4107],[351,357],[351, 400],[351,487],[351,521],[351,554],[351,585],[351,619],[351,642],[351,686],[351,722] ,[351,1200],[351,3800],[351,3819],[351,3858],[351,3877],[351,3915],[351,3934],[351,3963],[ 351,3992],[351,4050],[351,4069],[351,4107],[352,355],[352,373],[352,400],[352,487],[352,520] ,[352,555],[352,585],[352,621],[352,642],[352,686],[352,722],[352,1200],[352,3800],[ 352,3819],[352,3858],[352,3877],[352,3915],[352,3934],[352,3963],[352,3992],[352,4050],[352, 4069],[352,4107],[353,353],[353,375],[353,400],[353,487],[353,520],[353,556],[353,585] ,[353,623],[353,642],[353,686],[353,722],[353,1200],[353,3800],[353,3819],[353,3858],[ 353,3877],[353,3915],[353,3934],[353,3963],[353,3992],[353,4050],[353,4069],[353,4107],[354, 351],[354,376],[354,400],[354,487],[354,520],[354,558],[354,584],[354,625],[354,642] ,[354,686],[354,721],[354,1200],[354,3800],[354,3819],[354,3858],[354,3877]]4050],[351,4069],[351,4107],[352,355],[352,373],[352,400],[352,487],[352,520],[352,555] ,[352,585],[352,621],[352,642],[352,686],[352,722],[352,1200],[352,3800],[352,3819],[ 352,3858,[352,3877],[352,3915],[352,3934],[352,3963],[352,3992],[352,4050],[352,4069],[352, 4107],[353,353],[353,375],[353,400],[353,487],[353,520],[353,556],[353,585],[353,623] ,[353,642],[353,686],[353,722],[353,1200],[353,3800],[353,3819],[353,3858],[353,3877],[ 353,3915],[353,3934],[353,3963],[353,3992],[353,4050],[353,4069],[353,4107],[354,351],[354, 376],[354,400],[354,487],[354,520],[354,558],[354,584],[354,625],[354,642],[354,686] ,[354,721],[354,1200],[354,3800],[354,3819],[354,3858],[354,3877]]4050],[351,4069],[351,4107],[352,355],[352,373],[352,400],[352,487],[352,520],[352,555] ,[352,585],[352,621],[352,642],[352,686],[352,722],[352,1200],[352,3800],[352,3819],[ 352,3858,[352,3877],[352,3915],[352,3934],[352,3963],[352,3992],[352,4050],[352,4069],[352, 4107],[353,353],[353,375],[353,400],[353,487],[353,520],[353,556],[353,585],[353,623] ,[353,642],[353,686],[353,722],[353,1200],[353,3800],[353,3819],[353,3858],[353,3877],[ 353,3915],[353,3934],[353,3963],[353,3992],[353,4050],[353,4069],[353,4107],[354,351],[354, 376],[354,400],[354,487],[354,520],[354,558],[354,584],[354,625],[354,642],[354,686] ,[354,721],[354,1200],[354,3800],[354,3819],[354,3858],[354,3877]]4107],[352,355],[352,373],[352,400],[352,487],[352,520],[352,555],[352,585],[352,621] ,[352,642],[352,686],[352,722],[352,1200],[352,3800],[352,3819],[352,3858],[352,3877],[ 352,3915],[352,3934],[352,3963],[352,3992],[352,4050],[352,4069],[352,4107],[353,353],[353, [353],[353,400],[353,487],[353,520],[353,556],[353,585],[353,623],[353,642],[353,686] ,[353,722],[353,1200],[353,3800],[353,3819],[353,3858],[353,3877],[353,3915],[353,3934],[ 353,3963],[353,3992],[353,4050],[353,4069],[353,4107],[354,351],[354,376],[354,400],[354, 487],[354,520],[354,558],[354,584],[354,625],[354,642],[354,686],[354,721],[354,1200] ,[354,3800],[354,3819],[354,3858],[354,3877]]4107],[352,355],[352,373],[352,400],[352,487],[352,520],[352,555],[352,585],[352,621] ,[352,642],[352,686],[352,722],[352,1200],[352,3800],[352,3819],[352,3858],[352,3877],[ 352,3915],[352,3934],[352,3963],[352,3992],[352,4050],[352,4069],[352,4107],[353,353],[353, [353],[353,400],[353,487],[353,520],[353,556],[353,585],[353,623],[353,642],[353,686] ,[353,722],[353,1200],[353,3800],[353,3819],[353,3858],[353,3877],[353,3915],[353,3934],[ 353,3963],[353,3992],[353,4050],[353,4069],[353,4107],[354,351],[354,376],[354,400],[354, 487],[354,520],[354,558],[354,584],[354,625],[354,642],[354,686],[354,721],[354,1200] …
我有这个函数可以从特定位置读取文本文件:
<html>
<head>
<script>
function readTextFile(file){
var out_text=''
var rawFile = new XMLHttpRequest();
rawFile.open("GET", file, false);
rawFile.onreadystatechange = function (){
if(rawFile.readyState === 4){
if(rawFile.status === 200 || rawFile.status == 0){
var allText = rawFile.responseText;
out_text=allText;
}
}
}
rawFile.send(null);
return out_text;
}
</script>
</head>
<body>
<div id="file_txt"></div>
<script>
txt = readTextFile('http://arbsq.net/dchampolu/champolu_data.txt');
document.getElementById("file_txt").innerHTML=txt;
</script>
</body>
</html>
Run Code Online (Sandbox Code Playgroud)
问题是我想读取一个不断变化的文本文件: http://arbsq.net/dchampolu/champolu_data.txt
当我运行此函数一次时,浏览器会缓存文本文件,然后每次运行代码时,它只会获取缓存的版本,而不是更新的版本。
我总是可以告诉用户清理这个缓存和类似的东西,但是有没有一种方法可以通过 JavaScript 或 jQuery 强制浏览器仅使用文件的当前版本,而不是缓存版本?
我有几个排序列表,我想将它们一起添加到一个大的排序列表中.最有效的方法是什么?
这是我要做的,但效率太低:
big_list=[]
for slist in sorted_lists: # sorted_lists is a generator, so lists have to be added one by one
big_list.extend(slist)
big_list.sort()
Run Code Online (Sandbox Code Playgroud)
以下是sorted_lists的示例:
sorted_lists的大小= 200
sorted_lists的第一个元素的大小= 1668
sorted_lists=[
['000008.htm_181_0040_0009', '000008.htm_181_0040_0037', '000008.htm_201_0041_0031', '000008.htm_213_0029_0004', '000008.htm_263_0015_0011', '000018.htm_116_0071_0002', '000018.htm_147_0046_0002', '000018.htm_153_0038_0015', '000018.htm_160_0060_0001', '000018.htm_205_0016_0002', '000031.htm_4_0003_0001', '000032.htm_4_0003_0001', '000065.htm_5_0013_0005', '000065.htm_8_0008_0006', '000065.htm_14_0038_0036', '000065.htm_127_0016_0006', '000065.htm_168_0111_0056', '000072.htm_97_0016_0012', '000072.htm_175_0028_0020', '000072.htm_188_0035_0004'….],
['000018.htm_68_0039_0030', '000018.htm_173_0038_0029', '000018.htm_179_0042_0040', '000018.htm_180_0054_0021', '000018.htm_180_0054_0031', '000018.htm_182_0025_0023', '000018.htm_191_0041_0010', '000065.htm_5_0013_0007', '000072.htm_11_0008_0002', '000072.htm_14_0015_0002', '000072.htm_75_0040_0021', '000079.htm_11_0005_0000', '000079.htm_14_0006_0000', '000079.htm_16_0054_0006', '000079.htm_61_0018_0012', '000079.htm_154_0027_0011', '000086.htm_8_0003_0000', '000086.htm_9_0030_0005', '000086.htm_11_0038_0004', '000086.htm_34_0031_0024'….],
['000001.htm_13_0037_0004', '000008.htm_48_0025_0006', '000008.htm_68_0025_0008', '000008.htm_73_0024_0014', '000008.htm_122_0034_0026', '000008.htm_124_0016_0005', '000008.htm_144_0046_0030', '000059.htm_99_0022_0012', '000065.htm_69_0045_0017', '000065.htm_383_0026_0020', '000072.htm_164_0030_0002', …Run Code Online (Sandbox Code Playgroud) 我使用sklearn创建了一个SVC模型并将其腌制:
clf=LinearSVC(loss='l2', dual=False, tol=1e-3)
clf.fit(X_train, y_train)
#model_file_name='classify_pages_model'
with open('our_classifier.pkl', 'wb') as fid:
cPickle.dump(clf, fid)
Run Code Online (Sandbox Code Playgroud)
我尝试加载它并在另一个文件中使用它,
with open('our_classifier.pkl', 'rb') as fid:
clf = cPickle.load(fid)
X_test=tfidf_vectorizer.fit_transform((get_text(f) for f in urls))
pred=clf.predict(X_test)
Run Code Online (Sandbox Code Playgroud)
它给了我这个错误:
ValueError:X每个样本有664个特征; 期待47387
如何确保测试文档中的功能与模型中的功能相同?
- - 编辑
当我在相同的代码中进行训练和测试时,问题不会发生(但只有当我挑选模型并从另一个代码加载它时)
以下代码正常工作,但是当我挑选clf时,我无法执行测试部分,因为X_test中的功能数量与clf中的功能数量不同
1 - 培训
X_train=tfidf_vectorizer.fit_transform((read(f) for f in train_files_paths))
clf=LinearSVC(loss='l2', dual=False, tol=1e-3)
clf.fit(X_train, y_train)
Run Code Online (Sandbox Code Playgroud)
2-测试
X_test=tfidf_vectorizer.transform((get_text(f) for f in urls))
pred=clf.predict(X_test)
Run Code Online (Sandbox Code Playgroud) 对于一组典型的单词后缀(ize,fy,ly,able ...等),我想知道给定的单词是否以其中任何一个结尾,然后将其删除.我知道这可以用word.endswith('ize')迭代完成,但是我相信有一种更整洁的正则表达方式.尝试使用结束标记$的积极前瞻但是由于某种原因不起作用:
pat='(?=ate|ize|ify|able)$'
word='terrorize'
re.findall(pat,word)
Run Code Online (Sandbox Code Playgroud) 我需要将一些文本写入文件,同时包含换行符 \r 和 \n 的混合,我想同时保留两者。但是,在 python 3 中,当我将此文本写入文件时,\r 的所有实例都替换为 \n。这种行为与 python 2 不同,你可以在下面的输出中看到。我能做些什么来阻止这种更换?
这是代码:
import string
printable=string.printable
print([printable])
fopen=open("test.txt","w")
fopen.write(printable)
fopen.close()
fopen=open("test.txt","r")
content=fopen.read()
print([content])
fopen.close()
Run Code Online (Sandbox Code Playgroud)
这是输出,当我在 python 2 和 python 3 上运行代码时:
(base) Husseins-Air:Documents hmghaly$ python2.7 test_write_line_break.py
['0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c']
['0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c']
(base) Husseins-Air:Documents hmghaly$ python test_write_line_break.py
['0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c']
['0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\n\x0b\x0c']
Run Code Online (Sandbox Code Playgroud) python ×12
regex ×4
scikit-learn ×2
algorithm ×1
caching ×1
geometry ×1
javascript ×1
jquery ×1
list ×1
matplotlib ×1
sorting ×1