标签: feature-extraction

如何在Python中使用HashingVectorizer时获取特征名称？

我想制作一个二维二进制数组（n_samples，n_features），其中每个样本都是一个文本字符串，每个特征都是一个单词（unigram）。

问题是样本数量是 350000，特征数量是 40000，但我的 RAM 大小只有 4GB。

使用 CountVectorizer 后出现内存错误。那么，还有其他方法（比如小批量）来做到这一点吗？
如果我使用 HashingVectorizer 那么如何获取 feature_names？即哪一列对应于哪一个特征？，因为 HashingVectorizer 中没有 get_feature_names() 方法。

feature-extraction scikit-learn

deb*_*hya

2014 04-04

5
推荐指数

1
解决办法

3192
查看次数

R - 使用 SVD 获取特征数量减少的矩阵

我将 SVD 包与 R 一起使用，我可以通过将最低奇异值替换为 0 来降低矩阵的维数。但是当我重新组合矩阵时，我仍然具有相同数量的特征，我无法找到如何有效删除源矩阵中最无用的特征，以减少其列数。

例如我现在正在做的事情：

这是我的源矩阵 A：

Run Code Online (Sandbox Code Playgroud)

如果我做：

s = svd(A)
s$d[3:4] = 0  # Replacement of the 2 smallest singular values by 0
A' = s$u %*% diag(s$d)  %*% t(s$v)

Run Code Online (Sandbox Code Playgroud)

我得到 A'，它具有相同的尺寸（4x4），仅用 2 个“组件”进行重建，并且是 A 的近似值（包含较少的信息，可能较少的噪声等）：

      [,1]     [,2]      [,3]     [,4]
1 6.871009 5.887558 1.1791440 6.215131
2 3.799792 7.779251 2.3862880 4.357163
3 2.289294 …

Run Code Online (Sandbox Code Playgroud)

r feature-extraction svd dimensionality-reduction matrix-factorization

Cly*_*deX

2014 05-23

5
推荐指数

1
解决办法

2741
查看次数

低分辨率图像的特征检测器和描述符

我正在使用低分辨率 (VGA) 和 jpg 压缩的图像序列在移动机器人上进行视觉导航。目前我正在使用 SURF 来检测关键点并从图像中提取描述符，并使用 FLANN 来跟踪它们。在应用 RANSAC（通常会减少 20% 的匹配数量）之前，我每张图像得到 4000-5000 个特征，通常每对连续图像进行 350-450 个匹配

我正在努力增加比赛的数量（和质量）。我尝试了另外两种检测器：SIFT 和 ORB。SIFT 显着增加了特征的数量（总体上增加了 35% 的跟踪特征），但速度要慢得多。ORB 提取的特征大致与 SURF 一样多，但匹配性能要差得多（在最好的情况下大约为 100 个匹配）。我在 ORB 的 opencv 中的实现是：

cv::ORB orb = cv::ORB(10000, 1.2f, 8, 31);
orb(frame->img, cv::Mat(), im_keypoints, frame->descriptors);
frame->descriptors.convertTo(frame->descriptors, CV_32F); //so that is the same type as m_dists

Run Code Online (Sandbox Code Playgroud)

然后，在匹配时：

cv::Mat m_indices(descriptors1.rows, 2, CV_32S);
cv::Mat m_dists(descriptors1.rows, 2, CV_32F);
cv::flann::Index flann_index(descriptors2, cv::flann::KDTreeIndexParams(6));
flann_index.knnSearch(descriptors1, m_indices, m_dists, 2, cv::flann::SearchParams(64) );

Run Code Online (Sandbox Code Playgroud)

在处理低分辨率和嘈杂的图像时，最好的特征检测器和提取器是什么？我应该根据使用的特征检测器更改 FLANN 中的任何参数吗？

编辑：

我发布了一些相当容易跟踪的序列的图片。这些图片是我将它们提供给特征检测器方法的。它们已经过预处理以消除一些噪音（通过 cv::bilateralFilter()）

在此处输入图片说明

opencv image feature-extraction feature-detection feature-tracking

cap*_*ain

2019 09-18

5
推荐指数

1
解决办法

3501
查看次数

为什么SURF特征点有浮点坐标？

我刚刚编写了 C++ OpenCV 2.4.7 代码来使用 SurfFeatureDetector 提取立体图像中的特征点。它工作得很好，但是当我意识到点坐标是浮点时我感到困惑，例如它发现坐标 [283.23 123.424] 作为左图像中的特征之一。

这是代码的一部分（简单地提取特征）：

int minHessian = 400;

SurfFeatureDetector detector(minHessian);
vector<KeyPoint> featuresLeft, featuresRight;

detector.detect(leftImg, featuresLeft);
detector.detect(rightImg, featuresRight);

Run Code Online (Sandbox Code Playgroud)

谁能告诉我这是怎么发生的？内置函数是否涉及任何插值？

opencv image-processing feature-extraction feature-detection

ami*_*unt

2015 06-21

5
推荐指数

0
解决办法

668
查看次数

从文本中提取产品属性/特征

我被分配了一项任务，从产品描述中提取功能/属性。

Levi Strauss slim fit jeans
Big shopping bag in pink and gold

Run Code Online (Sandbox Code Playgroud)

我需要能够提取诸如“牛仔裤”和“修身”或“购物袋”和“粉色”和“金色”等属性。产品描述列表不仅仅适用于服装，基本上可以是任何东西。

我不知道如何解决这个问题。我尝试实现命名实体识别器解决方案和 POS 实现，NER 实现无法识别任何令牌，并且大多数令牌在 POS 解决方案中显示为 NNP（专有名词），这对我没有太大帮助。我需要一种方法来区分品牌名称和产品的功能（例如，如果它是 T 恤、颜色或设计（圆领、V 领）等）。

我确实实现了一个 KMean 解决方案，它将类似的产品聚集在一起，但话又说回来，这不是我正在寻找的结果。

只是寻找有人引导我走向正确的方向。

nlp named-entity-recognition feature-extraction named-entity-extraction

elr*_*ric

2016 03-16

5
推荐指数

1
解决办法

1982
查看次数

ValueError: Shape must be rank 1 but is rank 0 for 'ROIAlign/Crop' (op: 'CropAndResize') 输入形状：[2,360,475,3], [1,4], [], [2]

我试图在此函数中提供所有输入，但出现如下问题，我不确定空 [] 是什么。RGB 中有 2 个图像图像，原始代码来自https://github.com/CharlesShang/FastMaskRCNN/blob/master/libs/layers/crop.py。

Traceback (most recent call last):
  File "croptest.py", line 77, in <module>
    crop(img, boxes, batch_inds,1,7,7,'ROIAlign')
  File "croptest.py", line 64, in crop
    name='Crop')
  File "/home/ubuntu/Desktop/WK/my_project/lib/python2.7/site-packages/tensorflow/python/ops/gen_image_ops.py", line 166, in crop_and_resize
    name=name)
  File "/home/ubuntu/Desktop/WK/my_project/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/home/ubuntu/Desktop/WK/my_project/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2632, in create_op
    set_shapes_for_outputs(ret)
  File "/home/ubuntu/Desktop/WK/my_project/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1911, in set_shapes_for_outputs
    shapes = shape_func(op)
  File "/home/ubuntu/Desktop/WK/my_project/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1861, in call_with_requiring
    return call_cpp_shape_fn(op, require_shape_fn=True)
  File "/home/ubuntu/Desktop/WK/my_project/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 595, in call_cpp_shape_fn
    require_shape_fn)
  File "/home/ubuntu/Desktop/WK/my_project/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 659, in …

Run Code Online (Sandbox Code Playgroud)

python roi feature-extraction tensorflow

Go *_*t 2

2017 12-11

5
推荐指数

1
解决办法

1万
查看次数

Bag of Words (BOW) vs N-gram (sklearn CountVectorizer) - 文本文档分类

据我所知，在 Bag Of Words 方法中，特征是一组单词及其在文档中的频率计数。另一方面，N-grams，例如unigrams也完全一样，只是没有考虑单词出现的频率。

我想使用 sklearn 和 CountVectorizer 来实现 BOW 和 n-gram 方法。

对于 BOW，我的代码如下所示：

CountVectorizer(ngram_range=(1, 1), max_features=3000)

Run Code Online (Sandbox Code Playgroud)

是否足以将“二进制”参数设置为 True 以执行 n-gram 特征选择？

CountVectorizer(ngram_range=(1, 1), max_features=3000, binary=True)

Run Code Online (Sandbox Code Playgroud)

n-gram 相对于 BOW 方法的优势是什么？

python feature-extraction n-gram feature-selection scikit-learn

Tal*_*kus

lucky-day

5
推荐指数

1
解决办法

4488
查看次数

在 Python 中 SelectKBest 之前需要标准化

我需要从数据集中为回归任务选择一些特征。但是数值来自不同的范围。

from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectKBest, f_regression

X, y = load_boston(return_X_y=True)
X_new = SelectKBest(f_regression, k=2).fit_transform(X, y)

Run Code Online (Sandbox Code Playgroud)

为了提高回归模型的性能，我是否需要在SelectKBest方法之前对 X 进行归一化？

python feature-extraction

use*_*352

lucky-day

5
推荐指数

1
解决办法

853
查看次数

如何从 Pytorch 中的单个图像中提取特征向量？

我正在尝试更多地了解计算机视觉模型，并且我正在尝试对它们的工作方式进行一些探索。为了理解如何更多地解释特征向量，我尝试使用 Pytorch 来提取特征向量。下面是我从不同地方拼凑起来的代码。

import torch
import torch.nn as nn
import torchvision.models as models
import torchvision.transforms as transforms
from torch.autograd import Variable
from PIL import Image



img=Image.open("Documents/01235.png")

# Load the pretrained model
model = models.resnet18(pretrained=True)

# Use the model object to select the desired layer
layer = model._modules.get('avgpool')

# Set model to evaluation mode
model.eval()

transforms = torchvision.transforms.Compose([
        torchvision.transforms.Resize(256),
        torchvision.transforms.CenterCrop(224),
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])
    
def get_vector(image_name):
    # Load the image with Pillow library
    img = Image.open("Documents/Documents/Driven Data …

Run Code Online (Sandbox Code Playgroud)

python feature-extraction computer-vision pytorch

use*_*903

lucky-day

5
推荐指数

1
解决办法

2195
查看次数

混合数据类型的转换器

我无法一次将不同的转换器应用于不同类型（文本与数字）的列，并将这些转换器连接到一个转换器中以备后用。

我尝试按照Column Transformer with Mixed Types文档中的步骤进行操作，该文档解释了如何对分类和数字数据的混合执行此操作，但它似乎不适用于文本数据。

TL; 博士

您如何创建一个可存储的转换器，该转换器遵循不同的文本和数字数据管道？

数据下载和准备

# imports
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import StandardScaler

np.random.seed(0)

# download Titanic data
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

# data preparation
numeric_features = ['age', 'fare']
text_features = ['name', 'cabin', 'home.dest']
X.fillna({text_col: '' for text_col in text_features}, inplace=True) …

Run Code Online (Sandbox Code Playgroud)

python feature-extraction scikit-learn

kil*_*out

lucky-day

5
推荐指数

1
解决办法

176
查看次数