小编nik*_*osd的帖子

在matplotlib图例中设置numpoints不起作用

我试图通过遵循这里的建议在一个情节图例上有一个数据点,它似乎不起作用:

from pylab import scatter
import pylab
import matplotlib.pyplot as plt

fig = plt.figure()
ax = plt.gca()
ax.scatter(1,2,c = 'blue', marker = 'x')
ax.scatter(2,3, c= 'red', marker = 'o')
ax.legend(('1','2'), loc = 2, numpoints = 1)
plt.show()

Run Code Online (Sandbox Code Playgroud)

代码输出

我在做一些完全愚蠢的事吗？一些其他信息:

In [147]:  import matplotlib 
           print matplotlib.__version__

Out [147]: 1.1.1rc

Run Code Online (Sandbox Code Playgroud)

python matplotlib legend

nik*_*osd

2017 05-23

12
推荐指数

1
解决办法

3749
查看次数

从pandas DataFrame创建术语密度矩阵的有效方法

我正在尝试从pandas数据帧创建一个术语密度矩阵,因此我可以对出现在数据帧中的术语进行评级.我还希望能够保持我的数据的"空间"方面(请参阅帖子末尾的评论,以获取我的意思).

我是pandas和NLTK的新手,所以我希望我的问题可以解决一些现有的工具.

我有一个数据框,其中包含两个感兴趣的列:说'title'和'page'

    import pandas as pd
    import re

    df = pd.DataFrame({'title':['Delicious boiled egg','Fried egg ','Split orange','Something else'], 'page':[1, 2, 3, 4]})
    df.head()

       page                 title
    0     1  Delicious boiled egg
    1     2            Fried egg 
    2     3          Split orange
    3     4        Something else

Run Code Online (Sandbox Code Playgroud)

我的目标是清理文本,并将感兴趣的条款传递给TDM数据帧.我使用两个函数来帮助我清理字符串

    import nltk.classify
    from nltk.tokenize import wordpunct_tokenize
    from nltk.corpus import stopwords
    import string   

    def remove_punct(strin):
        '''
        returns a string with the punctuation marks removed, and all lower case letters
        input: strin, an ascii string. convert using strin.encode('ascii','ignore') if …

Run Code Online (Sandbox Code Playgroud)

python r nltk pandas

nik*_*osd

2014 03-06

7
推荐指数

1
解决办法

5739
查看次数

添加术语到python模式singularize的好方法

我使用python模式来获得单数形式的英语名词.

    In [1]: from pattern.en import singularize
    In [2]: singularize('patterns')
    Out[2]: 'pattern'
    In [3]: singularize('gases')
    Out[3]: 'gase'

Run Code Online (Sandbox Code Playgroud)

我通过定义来解决第二个例子中的问题

    def my_singularize(strn):
        '''
        Return the singular of a noun. Add special cases to correct pattern generic rules.
        '''
        exceptionDict = {'gases':'gas','spectra':'spectrum','cross':'cross','nuclei':'nucleus'}
        try:
            return exceptionDict[strn]
        except:
            return singularize(strn)

Run Code Online (Sandbox Code Playgroud)

有没有更好的方法来做到这一点,例如添加到模式规则,或以exceptionDict某种方式使模式内部？

python nlp

nik*_*osd

lucky-day

6
推荐指数

1
解决办法

1958
查看次数

python中TfidfVectorizer中n-gram的令牌模式

TfidfVectorizer是否使用python 正则表达式识别n-gram ？

在阅读scikit-learn TfidfVectorizer的文档时出现了这个问题,我看到在单词级别识别n-gram的模式是token_pattern=u'(?u)\b\w\w+\b'.我很难看到它是如何工作的.考虑bi-gram案例.如果我做:

    In [1]: import re
    In [2]: re.findall(u'(?u)\b\w\w+\b',u'this is a sentence! this is another one.')
    Out[2]: []

Run Code Online (Sandbox Code Playgroud)

我找不到任何双胞胎.鉴于:

    In [2]: re.findall(u'(?u)\w+ \w*',u'this is a sentence! this is another one.')
    Out[2]: [u'this is', u'a sentence', u'this is', u'another one']

Run Code Online (Sandbox Code Playgroud)

发现一些(但不是全部,例如u'is a',所有其他甚至计数的双字母都缺失).在解释\b字符函数时我做错了什么？

注意:根据正则表达式模块文档,re中的\b字符应该是:

\ b匹配空字符串,但仅匹配单词的开头或结尾.单词被定义为字母数字或下划线字符的序列,因此单词的结尾由空格或非字母数字的非下划线字符表示.

我看到问题解决识别蟒蛇正克的问题(见1,2),所以次要的问题是:我应该这样做,我的文字喂养TfidfVectorizer前添加加入正克？

python regex n-gram scikit-learn

nik*_*osd

2017 05-23

6
推荐指数

1
解决办法

2127
查看次数

通过列值(Python pandas)在DataFrame切片上查找计算函数的最快方法

我试图在数据框上创建一个列,其中包含列A(值列)的最小值,列B(id列)具有特定值.我的代码很慢.我正在寻找一种更快的方法来做到这一点.这是我的小功能:

def apply_by_id_value(df, id_col="id_col", val_col="val_col", offset_col="offset", f=min):
    for rid in set(df[id_col].values):
        df.loc[df[id_col] == rid, offset_col] =  f(df[df[id_col] == rid][val_col])
    return df

Run Code Online (Sandbox Code Playgroud)

示例用法:

import pandas as pd
import numpy as np
# create data frame
df = pd.DataFrame({"id_col":[0, 0, 0, 1, 1, 1, 2, 2, 2], 
                   "val_col":[0.1, 0.2, 0.3, 0.6, 0.4, 0.5, 0.2, 0.1, 0.0]})

print df.head(10)
# output
   id_col  val_col
0       0      0.1
1       0      0.2
2       0      0.3
3       1      0.6
4       1      0.4
5       1      0.5
6       2 …

Run Code Online (Sandbox Code Playgroud)

python performance pandas

nik*_*osd

lucky-day

6
推荐指数

1
解决办法

546
查看次数

ImportError:安装matplotlib时没有名为pkg_resources的模块

这是在CentOs 6.6上.我正在尝试建立一个科学的python环境.我想避开Anaconda.在尝试安装matplotlib时,我得到"ImportError:没有名为pkg_resources的模块".完整安装历史:

sudo yum install gcc-c++.x86_64
sudo yum install gcc
sudo yum install atlas atlas-devel lapack-devel blas-devel
sudo yum install python-devel
sudo pip install numpy
sudo pip install scipy
sudo pip install pandas
sudo pip install matplotlib

Run Code Online (Sandbox Code Playgroud)

在最后一步,我收到了消息

Complete output from command python setup.py egg_info:
The required version of distribute (>=0.6.28) is not available,
and can't be installed while this script is running. Please
install a more recent version first, using
'easy_install -U distribute'.

Run Code Online (Sandbox Code Playgroud)

然后我做

sudo pip install --upgrade distribute …

Run Code Online (Sandbox Code Playgroud)

python install centos pip matplotlib

nik*_*osd

2015 07-16

6
推荐指数

1
解决办法

2万
查看次数

concat后如何避免pandas DataFrame中的重复索引？

我有两个 pandas 数据框并将它们连接起来：

In[55]: adict  = {'a':[0, 1]}
        bdict = {'a': [2, 3]}
        dfa = DataFrame(adict)
        dfb = DataFrame(bdict)
        dfab = pd.concat([dfa,dfb])

Run Code Online (Sandbox Code Playgroud)

问题是，生成的数据帧具有重复的索引。

In [56]: dfab.head()

Out[56]:
                a
          0     0
          1     1
          0     2
          1     3

Run Code Online (Sandbox Code Playgroud)

如何让单个索引运行在生成的数据帧中，即

In [56]: dfab.head()

Out[56]:
                a
          0     0
          1     1
          2     2
          3     3

Run Code Online (Sandbox Code Playgroud)

python pandas

nik*_*osd

lucky-day

4
推荐指数

1
解决办法

1121
查看次数

如何在scikit中计算术语频率 - 学习CountVectorizer

我不明白CountVectorizer如何计算术语频率.我需要知道这一点,以便在max_df从语料库中过滤掉术语时,我可以为参数做出明智的选择.这是示例代码:

    import pandas as pd
    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer

    vectorizer = CountVectorizer(min_df = 1, max_df = 0.9)
    X = vectorizer.fit_transform(['afr bdf dssd','afr bdf c','afr'])
    word_freq_df = pd.DataFrame({'term': vectorizer.get_feature_names(), 'occurrences':np.asarray(X.sum(axis=0)).ravel().tolist()})
    word_freq_df['frequency'] = word_freq_df['occurrences']/np.sum(word_freq_df['occurrences'])
    print word_freq_df.sort('occurrences',ascending = False).head()

       occurrences  term  frequency
    0            3   afr   0.500000
    1            2   bdf   0.333333
    2            1  dssd   0.166667

Run Code Online (Sandbox Code Playgroud)

似乎'afr'出现在我的语料库中的一半术语中,正如我期望通过查看语料库.然而,当我max_df = 0.8进入时CountVectorizer,术语'afr'被从我的语料库中过滤掉.到处玩,我发现在我的例子中使用coprus,CountVectorizer为'afr'分配了一个~0.833的频率.有人可以提供一个关于如何max_df计算enterts的术语频率的公式吗？

谢谢

python tf-idf scikit-learn

nik*_*osd

lucky-day

3
推荐指数

1
解决办法

5105
查看次数

在Julia中对数组进行切片

在Julia中，我有一个数组数组，说：

    arr = Array(Array{Float64,1},3)
    for i = 1:3
        arr[i] = [i,-i]
    end

Run Code Online (Sandbox Code Playgroud)

现在：

   arr[:][1]
   2-element Array{Float64,1}:
      1.0
     -1.0

Run Code Online (Sandbox Code Playgroud)

和

   arr[1][:]
    2-element Array{Float64,1}:
     1.0
    -1.0

Run Code Online (Sandbox Code Playgroud)

似乎获得第一个“列”的唯一方法是通过理解

    pluses = [arr[i][1] for i=1:length(arr)]
    3-element Array{Any,1}:
     1.0
     2.0
     3.0

Run Code Online (Sandbox Code Playgroud)

那确实是唯一的方法吗？我是否通过运行for循环而不是某些“向量化”版本而失去速度，还是在Julia中由于编译器不同而无关紧要吗？

arrays julia

nik*_*osd

lucky-day

3
推荐指数

2
解决办法

3131
查看次数

标签统计

python ×8

pandas ×3

matplotlib ×2

scikit-learn ×2

arrays ×1

centos ×1

install ×1

julia ×1

legend ×1

n-gram ×1

nlp ×1

nltk ×1

performance ×1

pip ×1

r ×1

regex ×1

tf-idf ×1

标签 统计

小编nik_osd的帖子

标签统计