Python - 根据列表字典中的出现对列表中的项进行分类

Dar*_*ech 3 python dictionary list

我有这样的数据集(简化):

foods_dict = {}
foods_dict['fruit'] = ['apple', 'orange', 'plum']
foods_dict['veg'] = ['cabbage', 'potato', 'carrot']
Run Code Online (Sandbox Code Playgroud)

我有一个我想要分类的项目列表:

items = ['orange', 'potato', 'cabbage', 'plum', 'farmer', 'egg']
Run Code Online (Sandbox Code Playgroud)

我希望能够根据items列表中的项目将列表中的项目分配到较小的列表中foods_dict.我认为这些子列表实际应该是sets因为我不想在那里有任何重复.

我在代码中的第一次传递是这样的:

fruits = set()
veggies = set()
others = set()
for item in items:
    if item in foods_dict.get('fruit'):
        fruits.add(item)
    elif item in foods_dict.get('veg'):
        veggies.add(item)
    else:
        others.add(item)
Run Code Online (Sandbox Code Playgroud)

但这对我来说似乎效率低下且不必要地冗长.我的问题是,如何改进这些代码?我猜这里的列表理解可能很有用,但我不确定列表的数量.

任何帮助非常感谢.

Bak*_*riu 5

为了获得有效的解决方案,您希望尽可能避免显式循环:

items = set(items)
fruits = set(foods_dict['fruit']) & items
veggies = set(foods_dict['veg']) & items
others = items - fruits - veggies
Run Code Online (Sandbox Code Playgroud)

这几乎肯定比使用显式循环更快.特别item in foods_dict['fruit']是如果水果列表很长,那么这样做很费时间.


一个迄今为止解决方案之间简单的基准:

In [5]: %%timeit
   ...: items2 = set(items)
   ...: fruits = set(foods_dict['fruit']) & items2
   ...: veggies = set(foods_dict['veg']) & items2
   ...: others = items2 - fruits - veggies
   ...: 
1000000 loops, best of 3: 1.75 us per loop

In [6]: %%timeit
   ...: fruits = set()
   ...: veggies = set()
   ...: others = set()
   ...: for item in items:
   ...:     if item in foods_dict.get('fruit'):
   ...:         fruits.add(item)
   ...:     elif item in foods_dict.get('veg'):
   ...:         veggies.add(item)
   ...:     else:
   ...:         others.add(item)
   ...: 
100000 loops, best of 3: 2.57 us per loop

In [7]: %%timeit
   ...: veggies = set(elem for elem in items if elem in foods_dict['veg'])
   ...: fruits = set(elem for elem in items if elem in foods_dict['fruit'])
   ...: others = set(items) - veggies - fruits
   ...: 
100000 loops, best of 3: 3.34 us per loop
Run Code Online (Sandbox Code Playgroud)

当然,在选择之前你应该用"实际输入"进行一些测试.我不知道你的问题中的元素数量,并且时间可能会随着更大的输入而改变很多.无论如何,我的经验告诉我,至少在CPython中,显式循环往往比仅使用内置操作慢.


Edit2:输入更大的示例:

In [9]: foods_dict = {}
   ...: foods_dict['fruit'] = list(range(0, 10000, 2))
   ...: foods_dict['veg'] = list(range(1, 10000, 2))

In [10]: items = list(range(5, 10000, 13))  #some odd some even

In [11]: %%timeit
    ...: fruits = set()
    ...: veggies = set()
    ...: others = set()
    ...: for item in items:
    ...:     if item in foods_dict.get('fruit'):
    ...:         fruits.add(item)
    ...:     elif item in foods_dict.get('veg'):
    ...:         veggies.add(item)
    ...:     else:
    ...:         others.add(item)
    ...: 
10 loops, best of 3: 68.8 ms per loop

In [12]: %%timeit
    ...: veggies = set(elem for elem in items if elem in foods_dict['veg'])
    ...: fruits = set(elem for elem in items if elem in foods_dict['fruit'])
    ...: others = set(items) - veggies - fruits
    ...: 
10 loops, best of 3: 99.9 ms per loop

In [13]: %%timeit
    ...: items2 = set(items)
    ...: fruits = set(foods_dict['fruit']) & items2
    ...: veggies = set(foods_dict['veg']) & items2
    ...: others = items2 - fruits - veggies
    ...: 
1000 loops, best of 3: 445 us per loop
Run Code Online (Sandbox Code Playgroud)

正如您所看到的,仅使用内置函数比显式循环快约20倍.