在python中按字符串模式对项目进行分组

Question

在python中按字符串模式对项目进行分组

Bru*_*SXS 2 python regex iteration grouping

假设这个列表：

list1=["House of Mine (1293) Item 21",
       "House of Mine (1292) Item 24",
       "The yard (1000) Item 1 ",
       "The yard (1000) Item 2 ",
       "The yard (1000) Item 4 "]

Run Code Online (Sandbox Code Playgroud)

如果直到 (XXXX) 的子字符串相同，我想将它的每个项目添加到一个组中（在这种情况下是列表中的一个列表）。

所以，在这种情况下，我希望有：

[["House of Mine (1293) Item 21",
  "House of Mine (1292) Item 24"],

 ["The yard (1000) Item 1 ",
  "The yard (1000) Item 2 ",
  "The yard (1000) Item 4 "]

Run Code Online (Sandbox Code Playgroud)

以下代码是我能够制作的，但它不起作用：

def group(list1):
    group=[]
    for i, itemg in enumerate(list1):
        try:
            group[i]
        except Exception:
            group.append([])
        for itemj in group[i]:
            if re.findall(re.split("\(\d{4}\)\(", itemg)[0], itemj):
                group[i].append(itemg)
            else:
                group.append([])
                group[-1].append(itemg)

    return group

Run Code Online (Sandbox Code Playgroud)

我已经阅读了堆栈中的另一个主题，正则表达式页面 http://www.diveintopython3.net/regular-expressions.html

我知道答案就在上面，但我很难理解它的一些概念。

Answer 1

Jan*_*sky 7

将列表设置为分组：

>>> list1=["House of Mine (1293) Item 21","House of Mine (1292) Item 24", "The yard (1000) Item 1 ", "The yard (1000) Item 2 ", "The yard (1000) Item 4 "]

Run Code Online (Sandbox Code Playgroud)

定义一个函数，用于对项目进行排序和分组（这次使用括号中的数字）：

>>> keyf = lambda text: text.split("(")[1].split(")")[0]
>>> keyf
<function __main__.<lambda>>
>>> keyf(list1[0])
'1293'

Run Code Online (Sandbox Code Playgroud)

对列表进行排序（在此处放置）：

>>> list1.sort() #As Adam Smith noted, alphabetical sort is good enough

Run Code Online (Sandbox Code Playgroud)

从 itertools 中获取 groupby

>>> from itertools import groupby

Run Code Online (Sandbox Code Playgroud)

检查概念：

>>> for gr, items in groupby(list1, key = keyf):
...     print "gr", gr
...     print "items", list(items)
...
>>> list1
['The yard (1000) Item 1 ',
 'The yard (1000) Item 2 ',
 'The yard (1000) Item 4 ',
 'House of Mine (1292) Item 24',
 'House of Mine (1293) Item 21']

Run Code Online (Sandbox Code Playgroud)

请注意，我们必须调用list项目，就像项目items上的迭代器一样。

现在使用列表理解：

>>> res = [list(items) for gr, items in groupby(list1, key=keyf)]
>>> res
[['The yard (1000) Item 1 ',
  'The yard (1000) Item 2 ',
  'The yard (1000) Item 4 '],
 ['House of Mine (1292) Item 24'],
 ['House of Mine (1293) Item 21']]

Run Code Online (Sandbox Code Playgroud)

我们已经完成了。

如果要按 first 之前的所有文本分组"("，唯一的更改是：

>>> keyf = lambda text: text.split("(")[0]

Run Code Online (Sandbox Code Playgroud)

简短版本回答 OP

>>> list1=["House of Mine (1293) Item 21","House of Mine (1292) Item 24", "The yard (1000) Item 1 ", "The yard (1000) Item 2 ", "The yard (1000) Item 4 "]
>>> keyf = lambda text: text.split("(")[0]
>>> [list(items) for gr, items in groupby(sorted(list1), key=keyf)]
[['House of Mine (1293) Item 21', 'House of Mine (1292) Item 24'],
 ['The yard (1000) Item 1 ',
  'The yard (1000) Item 2 ',
  'The yard (1000) Item 4 ']]

Run Code Online (Sandbox Code Playgroud)

变化使用 `re.findall`

解决方案假定“（”是分隔符，并忽略了在那里有四位数字的要求。这样的任务可以使用re.

>>> import re
>>> keyf = lambda text: re.findall(".+(?=\(\d{4}\))", text)[0]
>>> text = 'House of Mine (1293) Item 21'
>>> keyf(text)
'House of Mine '

Run Code Online (Sandbox Code Playgroud)

但是IndexError: list index out of range如果文本没有预期的内容（我们试图从空列表中访问索引为 0 的项目），它会引发。

>>> text = "nothing here"
IndexError: list index out of range

Run Code Online (Sandbox Code Playgroud)

我们可以使用简单的技巧，为了生存，我们附加原始文本以确保有一些东西：

>>> keyf = lambda text: (re.findall(".+(?=\(\d{4}\))", text) + [text])[0]
>>> text = "nothing here"
>>> keyf(text)
'nothing here'

Run Code Online (Sandbox Code Playgroud)

最终解决方案使用 re

>>> import re
>>> from itertools import groupby
>>> keyf = lambda text: (re.findall(".+(?=\(\d{4}\))", text) + [text])[0]
>>> [list(items) for gr, items in groupby(sorted(list1), key=keyf)]
[['House of Mine (1292) Item 24', 'House of Mine (1293) Item 21'],
 ['The yard (1000) Item 1 ',
  'The yard (1000) Item 2 ',
  'The yard (1000) Item 4 ']]

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，6 月前
查看次数：	4709 次
最近记录：	11 年，6 月前

在python中按字符串模式对项目进行分组

简短版本回答 OP

变化使用 re.findall

变化使用 `re.findall`