如何准备图像分类训练数据

lma*_*aki 2 python machine-learning pandas scikit-learn data-science

我是机器学习新手,在图像分类方面遇到一些问题。使用简单的分类器技术 K 最近邻居,我试图区分猫和狗。

到目前为止我的代码:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

DATADIR = "/Users/me/Desktop/ds2/ML_image_classification/kagglecatsanddogs_3367a/PetImages"
CATEGORIES = ['Dog', 'Cat']

IMG_SIZE = 30
data = []
categories = []

for category in CATEGORIES:
    path = os.path.join(DATADIR, category) 
    categ_id = CATEGORIES.index(category)
    for img in os.listdir(path):
        try:
            img_array = cv2.imread(os.path.join(path,img), 0)
            new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
            data.append(new_array)
            categories.append(categ_id)
        except Exception as e:
            # print(e)
            pass

print(data[0])


s1 = pd.Series(data)
s2 = pd.Series(categories)
frame = {'Img array': s1, 'category': s2}
df = pd.DataFrame(frame) 


from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
Run Code Online (Sandbox Code Playgroud)

在这里,我在尝试拟合数据时遇到错误:

   ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-76-9d98d7b11202> in <module>
      2 from sklearn.neighbors import KNeighborsClassifier
      3 
----> 4 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
      5 
      6 print(X_train)

~/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py in train_test_split(*arrays, **options)
   2094         raise TypeError("Invalid parameters passed: %s" % str(options))
   2095 
-> 2096     arrays = indexable(*arrays)
   2097 
   2098     n_samples = _num_samples(arrays[0])

~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in indexable(*iterables)
    228         else:
    229             result.append(np.array(X))
--> 230     check_consistent_length(*result)
    231     return result
    232 

~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    203     if len(uniques) > 1:
    204         raise ValueError("Found input variables with inconsistent numbers of"
--> 205                          " samples: %r" % [int(l) for l in lengths])
    206 
    207 

ValueError: Found input variables with inconsistent numbers of samples: [24946, 22451400]
Run Code Online (Sandbox Code Playgroud)

如何正确准备训练数据?顺便提一句。我不想使用深度学习。这将是我的下一步。

将不胜感激这里的任何帮助..

Raj*_*oon 5

如果您不使用深度学习进行图像分类,则必须准备适合监督学习分类的数据。

脚步

1) 将所有图像调整为相同大小。您可以循环遍历每个图像并调整大小并保存。

2)获取每个图像的像素向量并创建数据集。例如,如果您的猫图像位于“Cat”文件夹中,而狗图像位于“Dog”文件夹中,则迭代该文件夹内的所有图像并获取像素值。同时将数据标记为“cat”(cat=1) 和“non-cat”(non-cat=0)

import os
import  imageio
import pandas as pd

catimages = os.listdir("Cat")
dogimages = os.listdir("Dog")
catVec = []
dogVec = []
for img in catimages:
       img = imageio.imread(f"Cat/{img}")
       ar = img.flatten()
       catVec.append(ar)    
catdf = pd.DataFrame(catVec)    
catdf.insert(loc=0,column ="label",value=1)

for img in dogimages:
       img = imageio.imread(f"Dog/{img}")
       ar = img.flatten()
       dogVec.append(ar)    
dogdf = pd.DataFrame(dogVec)    
dogdf.insert(loc=0,column ="label",value=0)
Run Code Online (Sandbox Code Playgroud)

3)连接catdf和dogdf并打乱数据帧

data = pd.concat([catdf,dogdf])      
data = data.sample(frac=1)
Run Code Online (Sandbox Code Playgroud)

现在您有了带有图像标签的数据集。

4)分割数据集进行训练和测试并拟合模型。