在Scikit中加载自定义数据集(类似于20个新闻组集)以进行文本文档的分类

Fly*_*ura 7 python nlp machine-learning dataset scikit-learn

我正在尝试为我的Ted Talks自定义数据集运行这个scikit示例代码.每个目录都是一个主题,其中包含文本文件,其中包含每个Ted Talk的描述.

这就是我的数据集树结构的方式.如您所见,每个目录都是一个主题,下面是带有描述的文本文件.

Topics/
|-- Activism
|   |-- 1149.txt
|   |-- 1444.txt
|   |-- 157.txt
|   |-- 1616.txt
|   |-- 1706.txt
|   |-- 1718.txt
|-- Adventure
|   |-- 1036.txt
|   |-- 1777.txt
|   |-- 2930.txt
|   |-- 2968.txt
|   |-- 3027.txt
|   |-- 3290.txt
|-- Advertising
|   |-- 3673.txt
|   |-- 3685.txt
|   |-- 6567.txt
|   `-- 6925.txt
|-- Africa
|   |-- 1045.txt
|   |-- 1072.txt
|   |-- 1103.txt
|   |-- 1112.txt
|-- Aging
|   |-- 1848.txt
|   |-- 2495.txt
|   |-- 2782.txt
|-- Agriculture
|   |-- 3469.txt
|   |-- 4140.txt
|   |-- 4733.txt
|   |-- 4939.txt
Run Code Online (Sandbox Code Playgroud)

我已经使我的数据集以这种形式类似于20news组,其树结构是这样的:

20news-18828/
|-- alt.atheism
|   |-- 49960
|   |-- 51060
|   |-- 51119

|-- comp.graphics
|   |-- 37261
|   |-- 37913
|   |-- 37914
|   |-- 37915
|   |-- 37916
|   |-- 37917
|   |-- 37918
|-- comp.os.ms-windows.misc
|   |-- 10000
|   |-- 10001
|   |-- 10002
|   |-- 10003
|   |-- 10004
|   |-- 10005 
Run Code Online (Sandbox Code Playgroud)

原始代码(98-124)中,这是直接从scikit加载训练和测试数据的方式.

print("Loading 20 newsgroups dataset for categories:")
print(categories if categories else "all")

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)
print('data loaded')

categories = data_train.target_names    # for case categories == None
def size_mb(docs):
    return sum(len(s.encode('utf-8')) for s in docs) / 1e6

data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)

print("%d documents - %0.3fMB (training set)" % (
    len(data_train.data), data_train_size_mb))
print("%d documents - %0.3fMB (test set)" % (
    len(data_test.data), data_test_size_mb))
print("%d categories" % len(categories))
print()

# split a training set and a test set
y_train, y_test = data_train.target, data_test.target
Run Code Online (Sandbox Code Playgroud)

由于这个数据集可以在Scikit中使用,因此它的标签都是内置的.对于我的情况,我知道如何加载数据集(第84行):

dataset = load_files('./TED_dataset/Topics/')
Run Code Online (Sandbox Code Playgroud)

我不知道在那之后我该怎么做.我想知道如何在训练和测试中分割这些数据,并从我的数据集生成这些标签:

data_train.data,  data_test.data 
Run Code Online (Sandbox Code Playgroud)

总而言之,我只是想加载我的数据集,在此代码上运行它没有错误.我已经在这里上传了数据集,供那些想要查看它的人使用.

我已经提到了这个简短谈论试验列车装载的问题.我还想知道如何从我的数据集中获取data_train.target_names.

编辑:

我试图让火车和测试返回错误:

dataset = load_files('./TED_dataset/Topics/')
train, test = train_test_split(dataset, train_size = 0.8)
Run Code Online (Sandbox Code Playgroud)

更新的代码在这里.

Rol*_*Max 10

我想你正在寻找这样的东西:

In [1]: from sklearn.datasets import load_files

In [2]: from sklearn.cross_validation import train_test_split

In [3]: bunch = load_files('./Topics')

In [4]: X_train, X_test, y_train, y_test = train_test_split(bunch.data, bunch.target, test_size=.4)

# Then proceed to train your model and validate.
Run Code Online (Sandbox Code Playgroud)

请注意,这bunch.target是一个整数数组,它是存储在其中的类别名称的索引bunch.target_names.

In [14]: X_test[:2]
Out[14]:
['Psychologist Philip Zimbardo asks, "Why are boys struggling?" He shares some stats (lower graduation rates, greater worries about intimacy and relationships) and suggests a few reasons -- and challenges the TED community to think about solutions.Philip Zimbardo was the leader of the notorious 1971 Stanford Prison Experiment -- and an expert witness at Abu Ghraib. His book The Lucifer Effect explores the nature of evil; now, in his new work, he studies the nature of heroism.',
 'Human growth has strained the Earth\'s resources, but as Johan Rockstrom reminds us, our advances also give us the science to recognize this and change behavior. His research has found nine "planetary boundaries" that can guide us in protecting our planet\'s many overlapping ecosystems.If Earth is a self-regulating system, it\'s clear that human activity is capable of disrupting it. Johan Rockstrom has led a team of scientists to define the nine Earth systems that need to be kept within bounds for Earth to keep itself in balance.']

In [15]: y_test[:2]
Out[15]: array([ 84, 113])

In [16]: [bunch.target_names[idx] for idx in y_test[:2]]
Out[16]: ['Education', 'Global issues']
Run Code Online (Sandbox Code Playgroud)