小编add_ons的帖子

Python:tf-idf-cosine:查找文档相似性

我正在学习第1 部分和第2 部分提供的教程.不幸的是,作者没有时间进行涉及使用余弦相似性的最后一节实际找到两个文档之间的距离.我在文章的示例中借助stackoverflow中的以下链接,包括上面链接中提到的代码(只是为了让生活更轻松)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import numpy as np
import numpy.linalg as LA

train_set = ["The sky is blue.", "The sun is bright."]  # Documents
test_set = ["The sun in the sky is bright."]  # Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray …

Run Code Online (Sandbox Code Playgroud)

python information-retrieval machine-learning nltk tf-idf

84
推荐指数

5
解决办法

9万
查看次数

Python:超出了最大递归深度

我有以下递归代码,在每个节点我调用sql查询以获取属于父节点的节点.

这是错误:

Exception RuntimeError: 'maximum recursion depth exceeded' in <bound method DictCursor.__del__ of <MySQLdb.cursors.DictCursor object at 0x879768c>> ignored

RuntimeError: maximum recursion depth exceeded while calling a Python object
Exception AttributeError: "'DictCursor' object has no attribute 'connection'" in <bound method DictCursor.__del__ of <MySQLdb.cursors.DictCursor object at 0x879776c>> ignored

Run Code Online (Sandbox Code Playgroud)

我调用以获取sql结果的方法:

def returnCategoryQuery(query, variables={}):
    cursor = db.cursor(cursors.DictCursor);
    catResults = [];
    try:
        cursor.execute(query, variables);
        for categoryRow in cursor.fetchall():
            catResults.append(categoryRow['cl_to']);
        return catResults;
    except Exception, e:
        traceback.print_exc();

Run Code Online (Sandbox Code Playgroud)

我实际上对上述方法没有任何问题,但我还是把它放在了正确的问题概述上.

递归代码:

def leaves(first, path=[]):
    if first:
        for elem in …

Run Code Online (Sandbox Code Playgroud)

python recursion max tree-traversal depth

79
推荐指数

1
解决办法

17万
查看次数

XPath ::获取以下兄弟姐妹

我有以下HTML结构:我正在尝试构建一个强大的方法来提取第二个颜色摘要元素,因为DOM中会有很多这样的标记.

<table>
  <tbody>
    <tr bgcolor="#AAAAAA">
    <tr>
    <tr>
    <tr>
    <tr>
      <td>Color Digest </td>
      <td>AgArAQICGQMVBBwTIRQHIwg0GUMURAZTBWQJcwV0AoEDAQ </td>
    </tr>
    <tr>
      <td>Color Digest </td>
      <td>2,43,2,25,21,28,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
    </tr>
  </tbody>
</table>

Run Code Online (Sandbox Code Playgroud)

我试图提取具有解码值的第二个"颜色摘要"td元素.

我写了下面的xpath,但没有得到第二个我没有得到第二个td元素.

//td[text() = ' Color Digest ']/following-sibling::td[2]

Run Code Online (Sandbox Code Playgroud)

当我把它改为td [2]到td [1]时,我得到了两个元素.

html xpath siblings scraper

68
推荐指数

2
解决办法

19万
查看次数

将索引(数字ID)列添加到大数据框

我有一个读取大型csv文件到数据框.csv文件中的数据来自表示用户信息的多个网站.例如,这里是数据框的结构.

user_id, number_of_logins, number_of_images, web
001, 34, 3, aa.com
002, 4, 4, aa.com
034, 3, 3, aa.com
001, 12, 4, bb.com
002, 1, 3, bb.com
034, 2, 2, cc.com

Run Code Online (Sandbox Code Playgroud)

正如您所看到的,一旦我将数据带入数据框,user_id就不再是唯一的ID,这会导致所有分析.我试图添加另一个列之前的user_id类似的东西,"generated_uid"并且几乎使用该data.frame列填充的索引.什么是实现这一目标的最佳方式.

65
推荐指数

4
解决办法

17万
查看次数

将纪元时间以毫秒转换为日期时间

我使用ruby脚本将iso时间戳转换为epoch,我正在解析的文件具有以下时间戳结构:

2009-03-08T00:27:31.807

Run Code Online (Sandbox Code Playgroud)

因为我想保持毫秒,我使用遵循ruby代码将其转换为纪元时间:

irb(main):010:0> DateTime.parse('2009-03-08T00:27:31.807').strftime("%Q")
=> "1236472051807"

Run Code Online (Sandbox Code Playgroud)

但在python我试过以下:

import time 
time.strftime('%Y-%m-%d %H:%M:%S', time.gmtime(1236472051807))

Run Code Online (Sandbox Code Playgroud)

但我没有得到原来的时间日期时间,

>>> time.strftime('%Y-%m-%d %H:%M:%S', time.gmtime(1236472051807))
'41152-03-29 02:50:07'
>>>

Run Code Online (Sandbox Code Playgroud)

我想知道它是如何格式化的？

ruby python datetime epoch

63
推荐指数

2
解决办法

9万
查看次数

Python:当没有返回时处理JSON解码错误

我正在解析json数据.解析时我没有问题,我正在使用simplejson模块.但是一些api请求返回空值.这是我的例子:

{
"all" : {
    "count" : 0,
    "questions" : [     ]
    }
}

Run Code Online (Sandbox Code Playgroud)

这是我解析json对象的代码片段:

 qByUser = byUsrUrlObj.read()
 qUserData = json.loads(qByUser).decode('utf-8')
 questionSubjs = qUserData["all"]["questions"]

Run Code Online (Sandbox Code Playgroud)

正如我提到的一些请求,我收到以下错误:

Traceback (most recent call last):
  File "YahooQueryData.py", line 164, in <module>
    qUserData = json.loads(qByUser)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/simplejson/__init__.py", line 385, in loads
    return _default_decoder.decode(s)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/simplejson/decoder.py", line 402, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/simplejson/decoder.py", line 420, in raw_decode
    raise JSONDecodeError("No JSON object could be decoded", s, idx)
simplejson.decoder.JSONDecodeError: No JSON object …

Run Code Online (Sandbox Code Playgroud)

python json python-3.x

57
推荐指数

2
解决办法

13万
查看次数

Pandas在列之间求和,并从该值中划分每个单元格

我已经阅读了一个csv文件并将其转换为以下结构:

pivoted = df.pivot('user_id', 'group', 'value')
lookup = df.drop_duplicates('user_id')[['user_id', 'group']]
lookup.set_index(['user_id'], inplace=True)
result = pivoted.join(lookup)
result = result.fillna(0)

Run Code Online (Sandbox Code Playgroud)

结果部分:

             0     1     2    3     4    5   6  7    8   9  10  11  12  13  group
user_id                                                                      
2        33653  2325   916  720   867  187  31  0    6   3  42  56  92  15    l-1
4        18895   414  1116  570  1190   55  92  0  122  23  78   6   4   2    l-2 
16        1383    70    27   17    17    1   0  0    0   0   1   0 …

Run Code Online (Sandbox Code Playgroud)

python dataframe pandas

40
推荐指数

4
解决办法

5万
查看次数

如何链接home brew python版本并将其设置为默认值

我刚刚从MacPorts切换到HomeBrew.在安装了所有必需的XCode版本和其他软件后,我尝试使用自制软件安装python:我认为它已成功安装,但是当我这样做which python时仍然向我显示2.7.3我认为是Mountain Lion附带的版本.

which python
/usr/local/bin/python

python --version
Python 2.7.3

Run Code Online (Sandbox Code Playgroud)

所以我试着再次安装

brew install python --framework --universal
Warning: python-2.7.5 already installed, it's just not linked

Run Code Online (Sandbox Code Playgroud)

但是它说python 2.7.5已经安装而没有链接,我试着这样做 brew link python

这导致我遵循这样的信息,我不知道我应该做什么:

链接/usr/local/Cellar/python/2.7.5 ...警告:无法链接python.取消链接...

Error: Could not symlink file: /usr/local/Cellar/python/2.7.5/bin/smtpd2.py
Target /usr/local/bin/smtpd2.py already exists. You may need to delete it.
To force the link and overwrite all other conflicting files, do:
  brew link --overwrite formula_name

To list all files that would be deleted:
  brew link --overwrite --dry-run formula_name

Run Code Online (Sandbox Code Playgroud)

python macos homebrew

39
推荐指数

7
解决办法

8万
查看次数

排序降序:Java Map

我想要做的是按值排序地图.我找了很多关于stackoverflow网站上可用的问题,并找到了以下解决方案,它做了我想要的但却错过了一件小事.

Link1:排序地图

但我遇到的问题是默认情况下按值按升序排序.我想按降序排序:

所以我做的是创建了一个实现比较器的类

class MyComparator implements Comparator {
    Map map;
    public MyComparator(Map map) {
        this.map = map;
    }
    public int compare(Object o1, Object o2) {
        return ((Integer) map.get(o2)).compareTo((Integer) map.get(o1));
    }
}

Run Code Online (Sandbox Code Playgroud)

然后我将我的地图传递给树图,

MyComparator comp = new MyComparator(myMap);
Map<String, Integer> newMap = new TreeMap(comp);
newMap.putAll(myMap);

Run Code Online (Sandbox Code Playgroud)

这似乎是不好的方法,因为我觉得这是低效的.有没有办法在链接中更改解决方案,默认情况下按降序排序.

java sorting hashmap

34
推荐指数

3
解决办法

7万
查看次数

ipython笔记本需要javascript

当我使用以下命令启动ipython服务器笔记本时:

$ ipython notebook --profile=myserver

Run Code Online (Sandbox Code Playgroud)

我得到以下屏幕,我不记得以前看过它.这似乎是一个交互式屏幕,我移动诅咒并点击输入,但我不知道我应该做什么,因为我之前没有看过这个并做了大量的谷歌搜索,无法找到我需要的任何详细信息选择.

    IPython Dashboard
   IPython Notebook requires JavaScript.                                                                                                               
   Please enable it to proceed.                                                                                                                        

   IPython Notebook                                                                                                                                    

     * Notebooks                                                                                                                                       
     * Clusters                                                                                                                                        

   To import a notebook, drag the file onto the listing below or click here. ____________________                                                      
   (Submit) Refresh (Submit) New Notebook                                                                                                              
     * /                                                                                                                                               
     * rootHome /                                                                                                                                          
     * subdir /                                                                                                                                        
     * anotherSubdir /                                                                                                                                       

   IPython parallel computing clusters (Submit) Refresh                                                                                                
   profile status # of engines action                                                                                                                  



(Form submit button) Use right-arrow or <return> to submit ('x' for no cache).                                                                         
  Arrow …

Run Code Online (Sandbox Code Playgroud)

ipython ipython-notebook

32
推荐指数

2
解决办法

2万
查看次数

标签统计

html ×1

information-retrieval ×1

ipython-notebook ×1

java ×1

json ×1

machine-learning ×1

max ×1

nltk ×1

r ×1

ruby ×1

tree-traversal ×1

«
1
2
3
4
5
…
8
»