Bru*_*SXS 2 python regex iteration grouping
假设这个列表:
list1=["House of Mine (1293) Item 21",
"House of Mine (1292) Item 24",
"The yard (1000) Item 1 ",
"The yard (1000) Item 2 ",
"The yard (1000) Item 4 "]
Run Code Online (Sandbox Code Playgroud)
如果直到 (XXXX) 的子字符串相同,我想将它的每个项目添加到一个组中(在这种情况下是列表中的一个列表)。
所以,在这种情况下,我希望有:
[["House of Mine (1293) Item 21",
"House of Mine (1292) Item 24"],
["The yard (1000) Item 1 ",
"The yard (1000) Item 2 ",
"The yard (1000) Item 4 "]
Run Code Online (Sandbox Code Playgroud)
以下代码是我能够制作的,但它不起作用:
def group(list1):
group=[]
for i, itemg in enumerate(list1):
try:
group[i]
except Exception:
group.append([])
for itemj in group[i]:
if re.findall(re.split("\(\d{4}\)\(", itemg)[0], itemj):
group[i].append(itemg)
else:
group.append([])
group[-1].append(itemg)
return group
Run Code Online (Sandbox Code Playgroud)
我已经阅读了堆栈中的另一个主题,正则表达式页面 http://www.diveintopython3.net/regular-expressions.html
我知道答案就在上面,但我很难理解它的一些概念。
将列表设置为分组:
>>> list1=["House of Mine (1293) Item 21","House of Mine (1292) Item 24", "The yard (1000) Item 1 ", "The yard (1000) Item 2 ", "The yard (1000) Item 4 "]
Run Code Online (Sandbox Code Playgroud)
定义一个函数,用于对项目进行排序和分组(这次使用括号中的数字):
>>> keyf = lambda text: text.split("(")[1].split(")")[0]
>>> keyf
<function __main__.<lambda>>
>>> keyf(list1[0])
'1293'
Run Code Online (Sandbox Code Playgroud)
对列表进行排序(在此处放置):
>>> list1.sort() #As Adam Smith noted, alphabetical sort is good enough
Run Code Online (Sandbox Code Playgroud)
从 itertools 中获取 groupby
>>> from itertools import groupby
Run Code Online (Sandbox Code Playgroud)
检查概念:
>>> for gr, items in groupby(list1, key = keyf):
... print "gr", gr
... print "items", list(items)
...
>>> list1
['The yard (1000) Item 1 ',
'The yard (1000) Item 2 ',
'The yard (1000) Item 4 ',
'House of Mine (1292) Item 24',
'House of Mine (1293) Item 21']
Run Code Online (Sandbox Code Playgroud)
请注意,我们必须调用list项目,就像项目items上的迭代器一样。
现在使用列表理解:
>>> res = [list(items) for gr, items in groupby(list1, key=keyf)]
>>> res
[['The yard (1000) Item 1 ',
'The yard (1000) Item 2 ',
'The yard (1000) Item 4 '],
['House of Mine (1292) Item 24'],
['House of Mine (1293) Item 21']]
Run Code Online (Sandbox Code Playgroud)
我们已经完成了。
如果要按 first 之前的所有文本分组"(",唯一的更改是:
>>> keyf = lambda text: text.split("(")[0]
Run Code Online (Sandbox Code Playgroud)
>>> list1=["House of Mine (1293) Item 21","House of Mine (1292) Item 24", "The yard (1000) Item 1 ", "The yard (1000) Item 2 ", "The yard (1000) Item 4 "]
>>> keyf = lambda text: text.split("(")[0]
>>> [list(items) for gr, items in groupby(sorted(list1), key=keyf)]
[['House of Mine (1293) Item 21', 'House of Mine (1292) Item 24'],
['The yard (1000) Item 1 ',
'The yard (1000) Item 2 ',
'The yard (1000) Item 4 ']]
Run Code Online (Sandbox Code Playgroud)
re.findall解决方案假定“(”是分隔符,并忽略了在那里有四位数字的要求。这样的任务可以使用re.
>>> import re
>>> keyf = lambda text: re.findall(".+(?=\(\d{4}\))", text)[0]
>>> text = 'House of Mine (1293) Item 21'
>>> keyf(text)
'House of Mine '
Run Code Online (Sandbox Code Playgroud)
但是IndexError: list index out of range如果文本没有预期的内容(我们试图从空列表中访问索引为 0 的项目),它会引发。
>>> text = "nothing here"
IndexError: list index out of range
Run Code Online (Sandbox Code Playgroud)
我们可以使用简单的技巧,为了生存,我们附加原始文本以确保有一些东西:
>>> keyf = lambda text: (re.findall(".+(?=\(\d{4}\))", text) + [text])[0]
>>> text = "nothing here"
>>> keyf(text)
'nothing here'
Run Code Online (Sandbox Code Playgroud)
最终解决方案使用 re
>>> import re
>>> from itertools import groupby
>>> keyf = lambda text: (re.findall(".+(?=\(\d{4}\))", text) + [text])[0]
>>> [list(items) for gr, items in groupby(sorted(list1), key=keyf)]
[['House of Mine (1292) Item 24', 'House of Mine (1293) Item 21'],
['The yard (1000) Item 1 ',
'The yard (1000) Item 2 ',
'The yard (1000) Item 4 ']]
Run Code Online (Sandbox Code Playgroud)