我正在学习第1 部分和第2 部分提供的教程.不幸的是,作者没有时间进行涉及使用余弦相似性的最后一节实际找到两个文档之间的距离.我在文章的示例中借助stackoverflow中的以下链接,包括上面链接中提到的代码(只是为了让生活更轻松)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import numpy as np
import numpy.linalg as LA
train_set = ["The sky is blue.", "The sun is bright."] # Documents
test_set = ["The sun in the sky is bright."] # Query
stopWords = stopwords.words('english')
vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray …
Run Code Online (Sandbox Code Playgroud) 我有以下递归代码,在每个节点我调用sql查询以获取属于父节点的节点.
这是错误:
Exception RuntimeError: 'maximum recursion depth exceeded' in <bound method DictCursor.__del__ of <MySQLdb.cursors.DictCursor object at 0x879768c>> ignored
RuntimeError: maximum recursion depth exceeded while calling a Python object
Exception AttributeError: "'DictCursor' object has no attribute 'connection'" in <bound method DictCursor.__del__ of <MySQLdb.cursors.DictCursor object at 0x879776c>> ignored
Run Code Online (Sandbox Code Playgroud)
我调用以获取sql结果的方法:
def returnCategoryQuery(query, variables={}):
cursor = db.cursor(cursors.DictCursor);
catResults = [];
try:
cursor.execute(query, variables);
for categoryRow in cursor.fetchall():
catResults.append(categoryRow['cl_to']);
return catResults;
except Exception, e:
traceback.print_exc();
Run Code Online (Sandbox Code Playgroud)
我实际上对上述方法没有任何问题,但我还是把它放在了正确的问题概述上.
递归代码:
def leaves(first, path=[]):
if first:
for elem in …
Run Code Online (Sandbox Code Playgroud) 我有以下HTML结构:我正在尝试构建一个强大的方法来提取第二个颜色摘要元素,因为DOM中会有很多这样的标记.
<table>
<tbody>
<tr bgcolor="#AAAAAA">
<tr>
<tr>
<tr>
<tr>
<td>Color Digest </td>
<td>AgArAQICGQMVBBwTIRQHIwg0GUMURAZTBWQJcwV0AoEDAQ </td>
</tr>
<tr>
<td>Color Digest </td>
<td>2,43,2,25,21,28,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
</tbody>
</table>
Run Code Online (Sandbox Code Playgroud)
我试图提取具有解码值的第二个"颜色摘要"td元素.
我写了下面的xpath,但没有得到第二个我没有得到第二个td元素.
//td[text() = ' Color Digest ']/following-sibling::td[2]
Run Code Online (Sandbox Code Playgroud)
当我把它改为td [2]到td [1]时,我得到了两个元素.
我有一个读取大型csv文件到数据框.csv文件中的数据来自表示用户信息的多个网站.例如,这里是数据框的结构.
user_id, number_of_logins, number_of_images, web
001, 34, 3, aa.com
002, 4, 4, aa.com
034, 3, 3, aa.com
001, 12, 4, bb.com
002, 1, 3, bb.com
034, 2, 2, cc.com
Run Code Online (Sandbox Code Playgroud)
正如您所看到的,一旦我将数据带入数据框,user_id就不再是唯一的ID,这会导致所有分析.我试图添加另一个列之前的user_id
类似的东西,"generated_uid"
并且几乎使用该data.frame
列填充的索引.什么是实现这一目标的最佳方式.
我使用ruby脚本将iso时间戳转换为epoch,我正在解析的文件具有以下时间戳结构:
2009-03-08T00:27:31.807
Run Code Online (Sandbox Code Playgroud)
因为我想保持毫秒,我使用遵循ruby代码将其转换为纪元时间:
irb(main):010:0> DateTime.parse('2009-03-08T00:27:31.807').strftime("%Q")
=> "1236472051807"
Run Code Online (Sandbox Code Playgroud)
但在python我试过以下:
import time
time.strftime('%Y-%m-%d %H:%M:%S', time.gmtime(1236472051807))
Run Code Online (Sandbox Code Playgroud)
但我没有得到原来的时间日期时间,
>>> time.strftime('%Y-%m-%d %H:%M:%S', time.gmtime(1236472051807))
'41152-03-29 02:50:07'
>>>
Run Code Online (Sandbox Code Playgroud)
我想知道它是如何格式化的?
我正在解析json数据.解析时我没有问题,我正在使用simplejson
模块.但是一些api请求返回空值.这是我的例子:
{
"all" : {
"count" : 0,
"questions" : [ ]
}
}
Run Code Online (Sandbox Code Playgroud)
这是我解析json对象的代码片段:
qByUser = byUsrUrlObj.read()
qUserData = json.loads(qByUser).decode('utf-8')
questionSubjs = qUserData["all"]["questions"]
Run Code Online (Sandbox Code Playgroud)
正如我提到的一些请求,我收到以下错误:
Traceback (most recent call last):
File "YahooQueryData.py", line 164, in <module>
qUserData = json.loads(qByUser)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/simplejson/__init__.py", line 385, in loads
return _default_decoder.decode(s)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/simplejson/decoder.py", line 402, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/simplejson/decoder.py", line 420, in raw_decode
raise JSONDecodeError("No JSON object could be decoded", s, idx)
simplejson.decoder.JSONDecodeError: No JSON object …
Run Code Online (Sandbox Code Playgroud) 我已经阅读了一个csv文件并将其转换为以下结构:
pivoted = df.pivot('user_id', 'group', 'value')
lookup = df.drop_duplicates('user_id')[['user_id', 'group']]
lookup.set_index(['user_id'], inplace=True)
result = pivoted.join(lookup)
result = result.fillna(0)
Run Code Online (Sandbox Code Playgroud)
结果部分:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 group
user_id
2 33653 2325 916 720 867 187 31 0 6 3 42 56 92 15 l-1
4 18895 414 1116 570 1190 55 92 0 122 23 78 6 4 2 l-2
16 1383 70 27 17 17 1 0 0 0 0 1 0 …
Run Code Online (Sandbox Code Playgroud) 我刚刚从MacPorts切换到HomeBrew.在安装了所有必需的XCode版本和其他软件后,我尝试使用自制软件安装python:我认为它已成功安装,但是当我这样做which python
时仍然向我显示2.7.3我认为是Mountain Lion附带的版本.
which python
/usr/local/bin/python
python --version
Python 2.7.3
Run Code Online (Sandbox Code Playgroud)
所以我试着再次安装
brew install python --framework --universal
Warning: python-2.7.5 already installed, it's just not linked
Run Code Online (Sandbox Code Playgroud)
但是它说python 2.7.5已经安装而没有链接,我试着这样做 brew link python
这导致我遵循这样的信息,我不知道我应该做什么:
链接/usr/local/Cellar/python/2.7.5 ...警告:无法链接python.取消链接...
Error: Could not symlink file: /usr/local/Cellar/python/2.7.5/bin/smtpd2.py
Target /usr/local/bin/smtpd2.py already exists. You may need to delete it.
To force the link and overwrite all other conflicting files, do:
brew link --overwrite formula_name
To list all files that would be deleted:
brew link --overwrite --dry-run formula_name
Run Code Online (Sandbox Code Playgroud) 我想要做的是按值排序地图.我找了很多关于stackoverflow网站上可用的问题,并找到了以下解决方案,它做了我想要的但却错过了一件小事.
Link1:排序地图
但我遇到的问题是默认情况下按值按升序排序.我想按降序排序:
所以我做的是创建了一个实现比较器的类
class MyComparator implements Comparator {
Map map;
public MyComparator(Map map) {
this.map = map;
}
public int compare(Object o1, Object o2) {
return ((Integer) map.get(o2)).compareTo((Integer) map.get(o1));
}
}
Run Code Online (Sandbox Code Playgroud)
然后我将我的地图传递给树图,
MyComparator comp = new MyComparator(myMap);
Map<String, Integer> newMap = new TreeMap(comp);
newMap.putAll(myMap);
Run Code Online (Sandbox Code Playgroud)
这似乎是不好的方法,因为我觉得这是低效的.有没有办法在链接中更改解决方案,默认情况下按降序排序.
当我使用以下命令启动ipython服务器笔记本时:
$ ipython notebook --profile=myserver
Run Code Online (Sandbox Code Playgroud)
我得到以下屏幕,我不记得以前看过它.这似乎是一个交互式屏幕,我移动诅咒并点击输入,但我不知道我应该做什么,因为我之前没有看过这个并做了大量的谷歌搜索,无法找到我需要的任何详细信息选择.
IPython Dashboard
IPython Notebook requires JavaScript.
Please enable it to proceed.
IPython Notebook
* Notebooks
* Clusters
To import a notebook, drag the file onto the listing below or click here. ____________________
(Submit) Refresh (Submit) New Notebook
* /
* rootHome /
* subdir /
* anotherSubdir /
IPython parallel computing clusters (Submit) Refresh
profile status # of engines action
(Form submit button) Use right-arrow or <return> to submit ('x' for no cache).
Arrow …
Run Code Online (Sandbox Code Playgroud)