小编tkj*_*kja的帖子

matplotlib错误栏可以设置线型吗？

是否可以将相同的linestyle设置为matplotlib错误栏而不是数据点线条样式？

在下面的示例中,绘制了两条线,其中一条是虚线,因为ls =' - .' 参数.但是,错误栏是实线.是否可以修改错误栏的样式/外观以匹配结果行？

import matplotlib.pyplot as plt
import numpy as np

x = np.array(range(0,10))
y = np.array(range(0,10))
yerr = np.array(range(1,11)) / 5.0
yerr2 = np.array(range(1,11)) / 4.0

y2 = np.array(range(0,10)) * 1.2

plt.errorbar(x, y, yerr=yerr, lw=8, errorevery=2, ls='-.')
plt.errorbar(x, y2, yerr=yerr2, lw=8, errorevery=3)
plt.show()

Run Code Online (Sandbox Code Playgroud)

python plot matplotlib

tkj*_*kja

2014 04-11

19
推荐指数

1
解决办法

1万
查看次数

Java Stream Collectors.toMap值是一个Set

我想使用Java Stream来运行POJO列表,例如List<A>下面的列表,并将其转换为Map Map<String, Set<String>>.

例如,A类是:

class A {
    public String name;
    public String property;
}

Run Code Online (Sandbox Code Playgroud)

我编写了下面的代码,将值收集到地图中Map<String, String>:

final List<A> as = new ArrayList<>();
// the list as is populated ...

// works if there are no duplicates for name
final Map<String, String> m = as.stream().collect(Collectors.toMap(x -> x.name, x -> x.property));

Run Code Online (Sandbox Code Playgroud)

但是,因为可能有多个相同的POJO name,我希望地图的值为a Set.property同一个键的所有字符串name都应该放在同一个字符集中.

如何才能做到这一点？

// how do i create a stream such that all properties of the same name …

Run Code Online (Sandbox Code Playgroud)

java java-8 java-stream

tkj*_*kja

2016 10-06

11
推荐指数

2
解决办法

5014
查看次数

gensim中的get_document_topics和get_term_topics

该ldamodel在gensim有两种方法:get_document_topics和get_term_topics.

尽管他们在这个gensim教程笔记本中使用,我还不完全理解如何解释输出get_term_topics并创建下面的自包含代码来显示我的意思:

from gensim import corpora, models

texts = [['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

# build the corpus, dict and train the model
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
model = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, 
                                 random_state=0, chunksize=2, passes=10)

# show the topics
topics …

Run Code Online (Sandbox Code Playgroud)

python gensim topic-modeling

tkj*_*kja

2017 04-22

8
推荐指数

1
解决办法

1万
查看次数

Scikit多级分类指标,分类报告

我正在使用scikit学习0.15.2来解决多类别问题.在下面的示例中,我得到了很多DeprecationWarnings:scikit 0.14多标签指标,直到我开始使用MultiLabelBinarizer:

"DeprecationWarning:直接支持序列序列多标记表示将不可用于0.17版本.使用sklearn.preprocessing.MultiLabelBinarizer转换为标签指示符表示."

但是,我无法找到一种方法来获得分类报告(精确,召回,f-measure),因为我之前可能如下所示:scikit 0.14多标签指标

我尝试使用inverse_transform,如下所示,这给了一个classification_report,但也再次给出了警告,从0.17开始,这段代码就会破坏.

如何获得多类分类问题的度量？

示例代码:

import numpy as np
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

# Some simple data:

X_train = np.array([[0,0,0], [0,0,1], [0,1,0], [1,0,0], [1,1,1]])
y_train = [[1], [1], [1,2], [2], [2]]

# Use MultiLabelBinarizer and train a multi-class classifier:

mlb = MultiLabelBinarizer(sparse_output=True)
y_train_mlb = mlb.fit_transform(y_train)

clf = OneVsRestClassifier(LinearSVC())
clf.fit(X_train, y_train_mlb)

# classification_report, here I did not find a way to use y_train_mlb, …

Run Code Online (Sandbox Code Playgroud)

python machine-learning scikits scikit-learn

tkj*_*kja

2017 05-23

6
推荐指数

1
解决办法

1762
查看次数

保留数字字符实体字符，例如` ` 在 Java 中解析 XML 时

我正在解析包含数字字符实体字符的 XML，例如（但不限于）
  < >（换行回车 < >）在 Java 中。在解析时，我将节点的文本内容附加到 StringBuffer 以便稍后将其写入文本文件。

但是，当我将字符串写入文件或打印出来时，这些 unicode 字符会被解析或转换为换行符/空格。

在 Java 中迭代 XML 文件的节点并将文本内容节点存储到字符串时，如何保留原始数字字符实体字符符号？

演示 xml 文件示例：

<?xml version="1.0" encoding="UTF-8"?>
<ABCD version="2">    
    <Field attributeWithChar="A string followed by special symbols &#13;  &#10;" />
</ABCD>

Run Code Online (Sandbox Code Playgroud)

示例 Java 代码。它加载 XML，遍历节点并将每个节点的文本内容收集到 StringBuffer。迭代结束后，它将 StringBuffer 写入控制台并写入文件（但没有
 ）符号。

将它们存储到字符串时，保留这些符号的方法是什么？请你帮助我好吗？谢谢你。

public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException, TransformerException {   
    DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
    Document document = null;
    DocumentBuilder documentBuilder = documentFactory.newDocumentBuilder();
    document = documentBuilder.parse(new File("path/to/demo.xml"));
    StringBuilder sb …

Run Code Online (Sandbox Code Playgroud)

java xml unicode dom sax

tkj*_*kja

2015 03-19

5
推荐指数

1
解决办法

4011
查看次数

scikit-learn FeatureUnion 对特征子集的网格搜索

如何在 scikit learn 中使用 FeatureUnion，以便 Gridsearch 可以选择处理其部分？

下面的代码可以工作并使用 TfidfVectorizer 为单词和字符 TfidfVectorizer 设置一个 FeatureUnion。

在进行 Gridsearch 时，除了测试定义的参数空间之外，我还想仅测试带有 ngram_range 参数的 'vect__wordvect'（没有用于字符的 TfidfVectorizer），并且仅测试带有小写参数 True 的 'vect__lettervect'和 False，另一个 TfidfVectorizer 被禁用。

编辑：基于 maxymoo 建议的完整代码示例。

如何才能做到这一点？

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.datasets import fetch_20newsgroups

# setup the featureunion
wordvect = TfidfVectorizer(analyzer='word')
lettervect = CountVectorizer(analyzer='char')
featureunionvect = FeatureUnion([("lettervect", lettervect), ("wordvect", wordvect)])

# setup the pipeline
classifier = LogisticRegression(class_weight='balanced')
pipeline = …

Run Code Online (Sandbox Code Playgroud)

python scikit-learn

tkj*_*kja

2016 05-12

5
推荐指数

1
解决办法

1796
查看次数

LabelEncoder适用于熊猫df的顺序

我在熊猫df中的一列上安装了scikit-learn LabelEncoder。

如何确定将遇到的字符串映射到整数的顺序？它是确定性的吗？

更重要的是，我可以指定此顺序吗？

import pandas as pd
from sklearn import preprocessing

df = pd.DataFrame(data=["first", "second", "third", "fourth"], columns=['x'])
le = preprocessing.LabelEncoder()
le.fit(df['x'])
print list(le.classes_)
### this prints ['first', 'fourth', 'second', 'third']
encoded = le.transform(["first", "second", "third", "fourth"]) 
print encoded
### this prints [0 2 3 1]

Run Code Online (Sandbox Code Playgroud)

我希望le.classes_是["first", "second", "third", "fourth"]，然后encoded是[0 1 2 3]，因为这是字符串在列中出现的顺序。能做到吗？

python pandas scikit-learn

tkj*_*kja

lucky-day

5
推荐指数

2
解决办法

2315
查看次数

在弹性搜索中删除/添加嵌套对象

我在Elastic 手册中找不到关于嵌套对象的示例，说明如何在 Kibana Sense 中使用 RESTful 命令修改文档的字段和嵌套对象。我在这里寻找类似于 Solrs原子更新的东西，它允许更新文档的特定字段。

Kibana Sense 中的 RESTful 命令如何实现这一点？我能找到的手册中唯一相关信息是Partial Updates to Documents，但我不知道如何将其应用于此用例。

例如，直接来自Elastic 文档：

PUT my_index
{
"mappings": {
    "my_type": {
    "properties": {
        "user": {
        "type": "nested" 
        }
    }
    }
}
}

PUT my_index/my_type/1
{
"group" : "fans",
"user" : [
    {
    "first" : "John",
    "last" :  "Smith"
    },
    {
    "first" : "Alice",
    "last" :  "White"
    }
]
}

Run Code Online (Sandbox Code Playgroud)

如何删除嵌套对象中的条目，使文档“1”看起来像：

{
"group" : "fans",
"user" : …

Run Code Online (Sandbox Code Playgroud)

json elasticsearch kibana

tkj*_*kja

2016 09-27

5
推荐指数

1
解决办法

4204
查看次数

空间和scikit学习矢量化器

我根据他们的示例使用spaCy为scikit-learn写了一个lemma令牌生成器，它可以独立运行：

import spacy
from sklearn.feature_extraction.text import TfidfVectorizer

class LemmaTokenizer(object):
    def __init__(self):
        self.spacynlp = spacy.load('en')
    def __call__(self, doc):
        nlpdoc = self.spacynlp(doc)
        nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
        return nlpdoc

vect = TfidfVectorizer(tokenizer=LemmaTokenizer())
vect.fit(['Apples and oranges are tasty.'])
print(vect.vocabulary_)
### prints {'apple': 1, 'and': 0, 'tasty': 4, 'be': 2, 'orange': 3}

Run Code Online (Sandbox Code Playgroud)

但是，使用它GridSearchCV会产生错误，下面是一个自包含的示例：

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.grid_search …

Run Code Online (Sandbox Code Playgroud)

python nlp scikit-learn spacy

tkj*_*kja

lucky-day

5
推荐指数

1
解决办法

2884
查看次数

正则表达式从开始标签匹配到空行或结束标签

如何匹配带有正则表达式的startlabel和空行或endlabel之间的内容？

例如regex101链接：

<START> some text is here. 
more text

unrelated text

<START> even more text. 
text text
<STOP>

Run Code Online (Sandbox Code Playgroud)

它应该匹配两个匹配项

<START> some text is here. 
more text

Run Code Online (Sandbox Code Playgroud)

和

<START> even more text. 
text text
<STOP>

Run Code Online (Sandbox Code Playgroud)

到目前为止，我提出的正则表达式如下（但由于（？s）。*部分，它与全文匹配）。

<START>((?s).*)(\s\s|<STOP>)

Run Code Online (Sandbox Code Playgroud)

regex

tkj*_*kja

lucky-day

3
推荐指数

1
解决办法

4959
查看次数

从 GridSearchCV 中提取 cross_val_predict 的最佳管道

如何从拟合中提取最佳管道GridSearchCV，以便将其传递给cross_val_predict？

直接传递fitGridSearchCV对象会导致cross_val_predict再次运行整个网格搜索，我只是想让最好的管道接受评估cross_val_predict。

我的独立代码如下：

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics

# fetch data data
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'), categories=['comp.graphics', 'rec.sport.baseball', 'sci.med'])
X = newsgroups.data
y = newsgroups.target

# setup and run GridSearchCV
wordvect = TfidfVectorizer(analyzer='word', lowercase=True)
classifier = OneVsRestClassifier(SVC(kernel='linear', class_weight='balanced')) …

Run Code Online (Sandbox Code Playgroud)

python machine-learning scikit-learn grid-search

tkj*_*kja

lucky-day

3
推荐指数

1
解决办法

4789
查看次数

为什么这个 inotifywait shellscript 购买了两个 PID？

我正在学习使用 inotifywait，特别是通过使用以下脚本：https ://unix.stackexchange.com/questions/24952/script-to-monitor-folder-for-new-files 。我不明白的是为什么我的脚本在我使用pid x.

36285 pts/1    S+     0:00 /bin/bash ./observe2.sh /home/user1/testfolder
36286 pts/1    S+     0:00 inotifywait -m /home/user1/testfolder -e create -e moved_to
36287 pts/1    S+     0:00 /bin/bash ./observe2.sh /home/user1/testfolder

Run Code Online (Sandbox Code Playgroud)

为了更快地测试，我更改了链接脚本，以便您可以通过 $1 传递任何文件夹进行观察，并保存为observe2.sh：

#!/bin/bash
inotifywait -m $1 -e create -e moved_to |
    while read path action file; do
        echo "The file '$file' appeared in directory '$path' via '$action'"
        # do something with the file
    done

Run Code Online (Sandbox Code Playgroud)

为什么脚本进程会出现两次？过程中是否有分叉？有人可以解释为什么会发生两个进程的这种行为吗？

unix linux bash inotifywait

tkj*_*kja

2017 04-13

1
推荐指数

1
解决办法

603
查看次数

GridSearchCV评分和grid_scores_

我试图了解如何获取GridSearchCV的得分者的值.下面的示例代码在文本数据上设置了一个小管道.

然后它在不同的ngrams上设置网格搜索.

评分是通过f1测量完成的:

#setup the pipeline
tfidf_vec = TfidfVectorizer(analyzer='word', min_df=0.05, max_df=0.95)
linearsvc = LinearSVC()
clf = Pipeline([('tfidf_vec', tfidf_vec), ('linearsvc', linearsvc)])

# setup the grid search
parameters = {'tfidf_vec__ngram_range': [(1, 1), (1, 2)]}
gs_clf = GridSearchCV(clf, parameters, n_jobs=-1, scoring='f1')
gs_clf = gs_clf.fit(docs_train, y_train)

Run Code Online (Sandbox Code Playgroud)

现在我可以打印得分:

print gs_clf.grid_scores_

[mean: 0.81548, std: 0.01324, params: {'tfidf_vec__ngram_range': (1, 1)},
 mean: 0.82143, std: 0.00538, params: {'tfidf_vec__ngram_range': (1, 2)}]

Run Code Online (Sandbox Code Playgroud)

print gs_clf.grid_scores_ [0] .cv_validation_scores

array([ 0.83234714,  0.8       ,  0.81409002])

Run Code Online (Sandbox Code Playgroud)

从文档中我不清楚:

是gs_clf.grid_scores_ [0] .cv_validation_scores一个数组,其中每个折叠通过评分参数定义得分(在这种情况下,每折次f1度量)？如果没有,那么它是什么？
如果我改为选择另一个度量标准 …

python scikit-learn

tkj*_*kja

lucky-day

1
推荐指数

1
解决办法

6003
查看次数

标签统计

python ×8

scikit-learn ×6

java ×2

machine-learning ×2

bash ×1

dom ×1

elasticsearch ×1

gensim ×1

grid-search ×1

inotifywait ×1

java-8 ×1

java-stream ×1

json ×1

kibana ×1

linux ×1

matplotlib ×1

nlp ×1

pandas ×1

plot ×1

regex ×1

sax ×1

scikits ×1

spacy ×1

topic-modeling ×1

unicode ×1

unix ×1

xml ×1

标签 统计

小编tkj_kja的帖子

标签统计