将键值是多个列表的字典拆分为训练集和测试集 Python

Jar*_*red 1 python dictionary split list

假设我有一本带有两个键(spam 和 ham)的字典,用于垃圾邮件和 ham 文本或电子邮件,如下所示:

data = {
    'spam': [
        ['hi', "what's", 'going', 'on', 'sexy', 'thing'], 
        ['1-800', 'call', 'girls', 'if', "you're", 'lonely'], 
        ['sexy', 'girls', 'for', 'youuuuuu']], 
    'ham': [['hey', 'hey', 'I', 'got', 'your', 'message,', "I'll", 'be', 'home', 'soon!!!'], 
        ['Madden', 'MUT', 'time', 'boys']]}
Run Code Online (Sandbox Code Playgroud)

我想将字典分成训练集和测试集(从 80/20 训练开始进行测试)。我希望分割时不考虑密钥,因此只需训练集总消息的 80% 和测试集总消息的 20% 的子集。在这个小例子中,我们总共有 5 条消息(3 条是垃圾邮件,2 条是正常邮件)。我已经四处寻找解决方案,但还没有找到任何可以处理此类情况的方法。

Ale*_*ile 5

使用恰当的名称sklearn.model_selection.train_test_split()

from sklearn.model_selection import train_test_split

data = {
    'spam': [
        ['hi', "what's", 'going', 'on', 'sexy', 'thing'],
        ['1-800', 'call', 'girls', 'if', "you're", 'lonely'],
        ['sexy', 'girls', 'for', 'youuuuuu']],
    'ham': [['hey', 'hey', 'I', 'got', 'your', 'message,', "I'll", 'be', 'home', 'soon!!!'],
            ['Madden', 'MUT', 'time', 'boys']]}

all_messages = [(words, k) for k, v in data.items() for words in v]

train, test = train_test_split(list(all_messages), test_size=0.2)
Run Code Online (Sandbox Code Playgroud)

您可以而且可能应该使用更强大的东西,例如 Pandas:

import pandas as pd
from sklearn.model_selection import train_test_split

data_dict = {
    'spam': [
        ['hi', "what's", 'going', 'on', 'sexy', 'thing'],
        ['1-800', 'call', 'girls', 'if', "you're", 'lonely'],
        ['sexy', 'girls', 'for', 'youuuuuu']],
    'ham': [['hey', 'hey', 'I', 'got', 'your', 'message,', "I'll", 'be', 'home', 'soon!!!'],
            ['Madden', 'MUT', 'time', 'boys']]}

df = pd.DataFrame(data=((k, words) for k, v in data_dict.items() for words in v))

print(df)

train, test = train_test_split(df, test_size=0.2)

print(train)
print(test)
Run Code Online (Sandbox Code Playgroud)

输出:

      0                                                  1
0  spam               [hi, what's, going, on, sexy, thing]
1  spam           [1-800, call, girls, if, you're, lonely]
2  spam                       [sexy, girls, for, youuuuuu]
3   ham  [hey, hey, I, got, your, message,, I'll, be, h...
4   ham                          [Madden, MUT, time, boys]

      0                                                  1
1  spam           [1-800, call, girls, if, you're, lonely]
2  spam                       [sexy, girls, for, youuuuuu]
0  spam               [hi, what's, going, on, sexy, thing]
3   ham  [hey, hey, I, got, your, message,, I'll, be, h...

     0                          1
4  ham  [Madden, MUT, time, boys]
Run Code Online (Sandbox Code Playgroud)