使用python将CSV文件转换为LIBSVM兼容数据文件

use*_*649 5 python java csv libsvm

我正在使用libsvm做一个项目,我正在准备我的数据来使用lib.如何将CSV文件转换为LIBSVM兼容数据?

CSV文件:https: //github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/data/iris.csv

在频率问题中:

如何将其他数据格式转换为LIBSVM格式?

这取决于您的数据格式.一种简单的方法是在libsvm matlab/octave接口中使用libsvmwrite.以UCI机器学习库中的CSV(逗号分隔值)文件为例.我们下载SPECTF.train.标签位于第一列.以下步骤以libsvm格式生成文件.

matlab> SPECTF = csvread('SPECTF.train'); % read a csv file
matlab> labels = SPECTF(:, 1); % labels from the 1st column
matlab> features = SPECTF(:, 2:end); 
matlab> features_sparse = sparse(features); % features must be in a sparse matrix
matlab> libsvmwrite('SPECTFlibsvm.train', labels, features_sparse);
The tranformed data are stored in SPECTFlibsvm.train.
Alternatively, you can use convert.c to convert CSV format to libsvm format.
Run Code Online (Sandbox Code Playgroud)

但我不想使用matlab,我使用python.

我也使用JAVA找到了这个解决方案

任何人都可以推荐一种解决这个问题的方法吗?

eme*_*eth 7

您可以使用csv2libsvm.py转换csvlibsvm data

python csv2libsvm.py iris.csv libsvm.data 4 True
Run Code Online (Sandbox Code Playgroud)

其中4表示target index,True表示csv有标题.

最后,你可以得到libsvm.data

0 1:5.1 2:3.5 3:1.4 4:0.2
0 1:4.9 2:3.0 3:1.4 4:0.2
0 1:4.7 2:3.2 3:1.3 4:0.2
0 1:4.6 2:3.1 3:1.5 4:0.2
...
Run Code Online (Sandbox Code Playgroud)

iris.csv

150,4,setosa,versicolor,virginica
5.1,3.5,1.4,0.2,0
4.9,3.0,1.4,0.2,0
4.7,3.2,1.3,0.2,0
4.6,3.1,1.5,0.2,0
...
Run Code Online (Sandbox Code Playgroud)


Mem*_*min 5

csv2libsvm.py不适用于Python3,而且它也不支持标签目标(字符串目标),我对其进行了轻微修改。现在它应该可以与 Python3 以及标签目标 w\xc4\xb1 一起使用。\n我对 Python 很陌生,所以我的代码可能不遵循最佳实践,但我希望它足以帮助某人。

\n
#!/usr/bin/env python\n\n"""\nConvert CSV file to libsvm format. Works only with numeric variables.\nPut -1 as label index (argv[3]) if there are no labels in your file.\nExpecting no headers. If present, headers can be skipped with argv[4] == 1.\n\n"""\n\nimport sys\nimport csv\nimport operator\nfrom collections import defaultdict\n\ndef construct_line(label, line, labels_dict):\n    new_line = []\n    if label.isnumeric():\n        if float(label) == 0.0:\n            label = "0"\n    else:\n        if label in labels_dict:\n            new_line.append(labels_dict.get(label))\n        else:\n            label_id = str(len(labels_dict))\n            labels_dict[label] = label_id\n            new_line.append(label_id)\n\n    for i, item in enumerate(line):\n        if item == \'\' or float(item) == 0.0:\n            continue\n        elif item==\'NaN\':\n            item="0.0"\n        new_item = "%s:%s" % (i + 1, item)\n        new_line.append(new_item)\n    new_line = " ".join(new_line)\n    new_line += "\\n"\n    return new_line\n\n# ---\n\ninput_file = sys.argv[1]\ntry:\n    output_file = sys.argv[2]\nexcept IndexError:\n    output_file = input_file+".out"\n\n\ntry:\n    label_index = int( sys.argv[3] )\nexcept IndexError:\n    label_index = 0\n\ntry:\n    skip_headers = sys.argv[4]\nexcept IndexError:\n    skip_headers = 0\n\ni = open(input_file, \'rt\')\no = open(output_file, \'wb\')\n\nreader = csv.reader(i)\n\nif skip_headers:\n    headers = reader.__next__()\n\nlabels_dict = {}\nfor line in reader:\n    if label_index == -1:\n        label = \'1\'\n    else:\n        label = line.pop(label_index)\n\n    new_line = construct_line(label, line, labels_dict)\n    o.write(new_line.encode(\'utf-8\'))\n
Run Code Online (Sandbox Code Playgroud)\n