小编Fri*_*ten的帖子

导入nltk库时找不到语料库/停用词

我试图在python 2.7中导入nltk包

  import nltk
  stopwords = nltk.corpus.stopwords.words('english')
  print(stopwords[:10])
Run Code Online (Sandbox Code Playgroud)

运行这个给我以下错误:

LookupError: 
**********************************************************************
Resource 'corpora/stopwords' not found.  Please use the NLTK
Downloader to obtain the resource:  >>> nltk.download()
Run Code Online (Sandbox Code Playgroud)

因此,我打开我的python终端并执行以下操作:

import nltk  
nltk.download()
Run Code Online (Sandbox Code Playgroud)

这给了我:

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
Run Code Online (Sandbox Code Playgroud)

然而,这似乎并没有停止.再次运行它仍然给我同样的错误.有什么想法出错吗?

python nltk

36
推荐指数
5
解决办法
5万
查看次数

使用sklearn StandardScaler缩放的数据平均值不为零

我有以下代码

import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np

df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True) # drops the empty line at file-end 

X = df.ix[:,0:4].values
y = df.ix[:,4].values
Run Code Online (Sandbox Code Playgroud)

接下来我缩放数据并得到平均值:

X_std = StandardScaler().fit_transform(X)
mean_vec = np.mean(X_std, axis=0)
Run Code Online (Sandbox Code Playgroud)

我没有得到的是我的输出是这样的:

[ -4.73695157e-16  -6.63173220e-16   3.31586610e-16  -2.84217094e-16]
Run Code Online (Sandbox Code Playgroud)

我确实理解这些值如何可以是除了0以外的任何值.如果我缩放它,它应该是0对吗?

任何人都可以向我解释这里发生了什么?

python numpy pandas scikit-learn

7
推荐指数
1
解决办法
2592
查看次数

在ggplot2图中添加一个额外的点

我用ggplot2创建了Sepal.Length和Sepal.Width(使用虹膜数据集)的图.

  ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, col = Species)) + geom_point()
Run Code Online (Sandbox Code Playgroud)

工作正常,但现在我想在图表中添加一个单独的蓝色点.例如:

  df = data.frame(Sepal.Width = 5.6, Sepal.Length = 3.9) 
Run Code Online (Sandbox Code Playgroud)

有关如何实现这一目标的任何想法?

r ggplot2

5
推荐指数
3
解决办法
2万
查看次数

在cygwin上配置aws时没有这样的文件或目录错误

我下载了Cygwin和Python 2.5版.现在我要在aws上建立一个深度学习计算机(遵循本教程:https://www.youtube.com/watch?v = 8rjRfW4JM2I )

如果我运行pip install awscli我得到这个(这很好)

 $ pip install awscli 
 Requirement already satisfied: awscli in c:\users\marc\anaconda2    \lib\site-packages
 Requirement already satisfied: s3transfer<0.2.0,>=0.1.9 in c:\users\marc\anaconda2\lib\site-packages (from awscli)
 Requirement already satisfied: rsa<=3.5.0,>=3.1.2 in c:\users\marc\anaconda2\lib\site-packages (from awscli)
 Requirement already satisfied: PyYAML<=3.12,>=3.10 in c:\users\marc\anaconda2\lib\site-packages (from awscli)
 Requirement already satisfied: docutils>=0.10 in c:\users\marc\anaconda2\lib\site-packages (from awscli)
Requirement already satisfied: botocore==1.4.92 in c:\users\marc\anaconda2\lib\site-packages (from awscli)
Requirement already satisfied: colorama<=0.3.7,>=0.2.5 in c:\users\marc\anaconda2\lib\site-packages (from awscli)
Requirement already satisfied: futures<4.0.0,>=2.2.0; python_version == "2.6" or python_version == …
Run Code Online (Sandbox Code Playgroud)

cygwin amazon-web-services

5
推荐指数
3
解决办法
6913
查看次数

无法得到推文的纬度和经度值

我收集了一些推特数据:

#connect to twitter API
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

#set radius and amount of requests
N=200  # tweets to request from each query
S=200  # radius in miles

lats=c(38.9,40.7)
lons=c(-77,-74)

roger=do.call(rbind,lapply(1:length(lats), function(i) searchTwitter('Roger+Federer',
                                                                lang="en",n=N,resultType="recent",
                                                              geocode=paste  (lats[i],lons[i],paste0(S,"mi"),sep=","))))
Run Code Online (Sandbox Code Playgroud)

在此之后我完成了:

rogerlat=sapply(roger, function(x) as.numeric(x$getLatitude()))
rogerlat=sapply(rogerlat, function(z) ifelse(length(z)==0,NA,z))  

rogerlon=sapply(roger, function(x) as.numeric(x$getLongitude()))
rogerlon=sapply(rogerlon, function(z) ifelse(length(z)==0,NA,z))  

data=as.data.frame(cbind(lat=rogerlat,lon=rogerlon))
Run Code Online (Sandbox Code Playgroud)

现在我想获得所有具有long和lat值的推文:

data=filter(data, !is.na(lat),!is.na(lon))
lonlat=select(data,lon,lat)
Run Code Online (Sandbox Code Playgroud)

但是现在我只获得了NA值....对这里出了什么问题的任何想法?

r text-mining tm

3
推荐指数
1
解决办法
1453
查看次数

从数组中获取最大值

我有一个看起来像这样的数组:

Dim values(1 To 3) As String

values(1) = Sheets("risk_cat_2").Cells(4, 6).Value
values(2) = Sheets("risk_cat_2").Cells(5, 6).Value
values(3) = Sheets("risk_cat_2").Cells(6, 6).Value
Run Code Online (Sandbox Code Playgroud)

我现在想做的是从字符串中的所有值中获取最大值。VBA中有一种简单的方法可以从数组中获取最大值吗?

excel vba excel-vba

3
推荐指数
2
解决办法
3万
查看次数

使用朴素贝叶来预测新值

我有一个看起来像这样的数据框

weather <- c("good", "good", "good", "bad", "bad", "good")
temp <- c("high", "low", "low", "high", "low", "low")
golf <- c("yes", "no", "yes", "no", "yes" , "no")
df <- data.frame(weather, temp, golf)
Run Code Online (Sandbox Code Playgroud)

我现在想做的是使用朴素贝叶斯方法来获得这个新数据集的概率

df_new <- data.frame(weather = "good", temp = "low")
Run Code Online (Sandbox Code Playgroud)

所以我试试

library(e1071)
model <- naiveBayes(golf ~.,data=df)
predict(model, df_new)
Run Code Online (Sandbox Code Playgroud)

但这给了我:

NO
Run Code Online (Sandbox Code Playgroud)

知道我怎么能把它变成概率?

r naivebayes

2
推荐指数
1
解决办法
6320
查看次数

为文本挖掘创建词汇词典

我有以下代码:

train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
    "We can see the shining sun, the bright sun.")
Run Code Online (Sandbox Code Playgroud)

现在我试图计算这样的词频:

    from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer()
Run Code Online (Sandbox Code Playgroud)

接下来我想打印词汇表。因此我这样做:

vectorizer.fit_transform(train_set)
print vectorizer.vocabulary
Run Code Online (Sandbox Code Playgroud)

现在我得到的输出没有。虽然我期待这样的事情:

{'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}
Run Code Online (Sandbox Code Playgroud)

任何想法哪里出了问题?

python nlp text-mining

2
推荐指数
1
解决办法
8415
查看次数

尝试随机化数据集时出现奇怪的错误

我尝试使用以下代码对数据进行洗牌。

import pandas as pd
import numpy as np

from sklearn.naive_bayes import MultinomialNB
 data = pd.read_csv('dataset.txt')
 np.random.shuffle(data)
Run Code Online (Sandbox Code Playgroud)

然而,运行它会给我以下错误。我不明白这个错误是从哪里来的。

Traceback (most recent call last):
File "sample2.py", line 12, in <module>
 np.random.shuffle(data)
File "mtrand.pyx", line 4668, in mtrand.RandomState.shuffle (numpy/random /mtrand/mtrand.c:30498)
 File "mtrand.pyx", line 4671, in mtrand.RandomState.shuffle (numpy/random/mtrand/mtrand.c:30438)
 File "/Users/marcvanderpeet/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/frame.py", line 1992, in __getitem__
 return self._getitem_column(key)
 File "/Users/marcvanderpeet/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/frame.py", line 2004, in _getitem_column
 result = result[key]
 File "/Users/marcvanderpeet/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/frame.py", line 1992, in __getitem__
 return self._getitem_column(key)
 File "/Users/marcvanderpeet/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/frame.py", line 1999, in _getitem_column
 return self._get_item_cache(key)
 File "/Users/marcvanderpeet/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/generic.py", …
Run Code Online (Sandbox Code Playgroud)

python pandas

1
推荐指数
1
解决办法
944
查看次数