标签: feature-extraction

python中的线程池没有预期的那么快

我是 Python 和机器学习的初学者。我正在尝试重现countvectorizer()使用多线程的代码。我正在使用 yelp 数据集使用LogisticRegression. 这是我到目前为止所写的：

代码片段：

from multiprocessing.dummy import Pool as ThreadPool
from threading import Thread, current_thread
from functools import partial
data = df['text']
rev = df['stars'] 


y = []
def product_helper(args):
    return featureExtraction(*args)


def featureExtraction(p,t):     
    temp = [0] * len(bag_of_words)
    for word in p.split():
        if word in bag_of_words:
            temp[bag_of_words.index(word)] += 1

    return temp


# function to be mapped over
def calculateParallel(threads): 
    pool = ThreadPool(threads)
    job_args = [(item_a, rev[i]) for i, item_a in enumerate(data)]
    l …

Run Code Online (Sandbox Code Playgroud)

python multithreading machine-learning feature-extraction

bha*_*udi

2016 11-26

3
推荐指数

1
解决办法

3216
查看次数

在 Featuretools 中使用多个训练窗口计算相同的特征

Featuretools 支持已经处理多个截止时间https://docs.featuretools.com/automated_feature_engineering/handling_time.html

In [20]: temporal_cutoffs = ft.make_temporal_cutoffs(cutoffs['customer_id'],
   ....:                                             cutoffs['cutoff_time'],
   ....:                                             window_size='3d',
   ....:                                             num_windows=2)
   ....: 

In [21]: temporal_cutoffs
Out[21]: 
        time  instance_id
0 2011-12-12        13458
1 2011-12-15        13458
2 2012-10-02        13602
3 2012-10-05        13602
4 2012-01-22        15222
5 2012-01-25        15222

In [22]: entityset = ft.demo.load_retail()

In [23]: feature_tensor, feature_defs = ft.dfs(entityset=entityset,
   ....:                                       target_entity='customers',
   ....:                                       cutoff_time=temporal_cutoffs,
   ....:                                       cutoff_time_in_index=True,
   ....:                                       max_features=4)
   ....: 

In [24]: feature_tensor
Out[24]: 
                        MAX(order_products.total)  MIN(order_products.unit_price)  STD(order_products.quantity)  COUNT(order_products)
customer_id time                                                                                                                      
13458.0     2011-12-12                    201.960                          0.3135                     10.053804                    394
            2011-12-15                    201.960                          0.3135                     10.053804                    394 …

Run Code Online (Sandbox Code Playgroud)

python feature-extraction pandas feature-engineering featuretools

Geo*_*ler

2018 10-02

3
推荐指数

1
解决办法

622
查看次数

涉及多个特征时如何处理文本分类问题

我正在研究一个文本分类问题，其中多个文本特征需要构建一个模型来预测工资范围。请参考样本数据集大多数资源/教程只处理一列的特征提取，然后预测目标。我了解文本预处理、特征提取（CountVectorizer 或 TF-IDF）以及应用算法等过程。

在这个问题中，我有多个输入文本功能。当涉及多个特征时如何处理文本分类问题？这些是我已经尝试过的方法，但我不确定这些方法是否正确。请提供您的意见/建议。

1) 分别对每个特征应用数据清理，然后是 TF-IDF，然后是逻辑回归。在这里我试着看看我是否可以只使用一个特征进行分类。

2) 分别对所有列应用数据清洗，然后对每个特征应用 TF-IDF，然后合并所有特征向量以仅创建一个特征向量。最后是逻辑回归。

3) 分别对所有列应用数据清理并合并所有清理过的列以创建一个特征“merged_text”。然后在这个合并的文本上应用 TF-IDF，然后进行逻辑回归。

所有这 3 种方法在交叉验证和测试集上都给了我大约 35-40% 的准确率。我期望在未提供的测试集上至少有 60% 的准确度。

另外，我不明白如何使用文本数据来使用“company_name”和“experience”。company_name 中大约有 2000 多个唯一值。请提供有关如何处理文本分类问题中的数字数据的输入/指针。

python nlp feature-extraction text-classification

Che*_*mbi

lucky-day

3
推荐指数

1
解决办法

2264
查看次数

如何使用Python和OpenCV实现BRISK来检测特征？

我想BRISK使用Python和OpenCV实现无人机图像中的特征检测和描述。

由于BRISK也是一个描述符，我想使用它的描述特征来匹配两个图像。

我该怎么做？

python opencv feature-extraction feature-detection mser

Dai*_*ary

2021 01-24

3
推荐指数

1
解决办法

6775
查看次数

文件功能矢量表示

我正在构建一个文档分类器来对文档进行分类.

因此,第一步是将每个文档表示为用于训练目的的"特征向量".

经过一些研究,我发现我可以使用Bag of Words方法或N-gram方法将文档表示为向量.

使用OCR检索每个文档中的文本(扫描的pdf和图像),因此某些单词包含错误.我以前没有关于这些文件中使用的语言的知识(不能使用词干).

据我所知,我必须使用n-gram方法.还是有其他方法来表示文件？

如果有人可以将我链接到N-Gram指南以便更清晰地了解并了解其工作方式,我也将不胜感激.

提前致谢

algorithm machine-learning feature-extraction document-classification

TeF*_*eFa

lucky-day

2
推荐指数

1
解决办法

1499
查看次数

使用countvectorizer和tfidfvectorizer作为KMeans文本聚类的特征向量是否有意义？

我试图从我的csv文件中构建我的特征向量,其中包含大约1000条评论.我的一个特征向量是使用scikit learn的tfidf矢量化器的tfidf.将count作为特征向量还是使用更好的特征向量是否有意义？

如果我最终使用Countvectorizer和tfidfvectorizer作为我的功能,我应该如何将它们都装入我的Kmeans模型(特别是km.fit()部分)？目前我只能将tfidf特征向量拟合到模型中.

这是我的代码:

vectorizer=TfidfVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)

#count_vectorizer=CountVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
#count_vectorized=count_vectorizerfit_transform(sentence_list)

km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)

Run Code Online (Sandbox Code Playgroud)

python machine-learning feature-extraction scipy scikit-learn

jxn*_*jxn

2014 12-17

2
推荐指数

1
解决办法

3241
查看次数

灰度分割/特征提取/斑点检测？

我试图找到一个起点,但我似乎无法找到正确的答案.我非常感谢你的指导.我也不知道正确的术语,因此标题.

我拍了一张背后有黑色背景的包.
我想提取袋子,类似于此.
如果可能的话,找中心,像这样.

基本上,我希望能够提取像素blob然后找到中心点.

我知道这是两个不同的问题,但我想如果有人可以做后者,那么他们可以做第一个.我正在使用MATLAB,但想编写自己的代码而不使用像edge()这样的图像处理函数.我可以使用哪些方法/算法？任何论文/链接都会很好(:

matlab image-processing feature-extraction feature-detection image-segmentation

Bad*_*mer

lucky-day

2
推荐指数

1
解决办法

1万
查看次数

特征工程和特征提取有什么区别？

我正在努力找到这两个概念之间的区别.据我所知,两者都指的是将原始数据转换为更全面的功能来描述手头的问题.它们是一样的吗？如果没有,请为两者提供示例吗？

machine-learning data-mining feature-extraction

Stu*_*SQL

2016 08-28

2
推荐指数

1
解决办法

3705
查看次数

如何计算pandas数据框(大写和小写)中的元音和辅音？

这是我的数据

No  Body
1   DaTa, Analytics
2   StackOver.

Run Code Online (Sandbox Code Playgroud)

这是我的预期输出

No  Body                 Vowels   Consonant  
1   DaTa, Analytics.     5        8        
2   StackOver.           3        6

Run Code Online (Sandbox Code Playgroud)

python regex text feature-extraction pandas

Nab*_*zir

2018 07-17

2
推荐指数

1
解决办法

502
查看次数

为什么文本的特征提取未返回所有可能的特征名称？

这是《使用PyTorch进行自然语言处理》一书中的代码片段：

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns

corpus = ['Time flies flies like an arrow.', 'Fruit flies like a banana.']
one_hot_vectorizer = CountVectorizer()
vocab = one_hot_vectorizer.get_feature_names()

Run Code Online (Sandbox Code Playgroud)

的值vocab：

vocab = ['an', 'arrow', 'banana', 'flies', 'fruit', 'like', 'time']

Run Code Online (Sandbox Code Playgroud)

为什么'a'提取的要素名称中没有？如果自动将其排除为太普通的单词，出于同样的原因为什么不排除“ an”？如何也.get_feature_names()过滤其他单词？

python nlp feature-extraction scikit-learn pytorch

use*_*115

2019 03-04

2
推荐指数

1
解决办法

92
查看次数

标签统计

feature-extraction ×10

python ×7

machine-learning ×4

feature-detection ×2

nlp ×2

pandas ×2

scikit-learn ×2

algorithm ×1

data-mining ×1

document-classification ×1

feature-engineering ×1

featuretools ×1

image-processing ×1

image-segmentation ×1

matlab ×1

mser ×1

multithreading ×1

opencv ×1

pytorch ×1

regex ×1

scipy ×1

text ×1

text-classification ×1

标签 统计

标签统计