vas*_*sor 13 python string filtering list python-2.7
我在过滤字符串列表时遇到了一些麻烦.我在这里发现了类似的问题 但不是我需要的.
输入列表是:
l = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']
Run Code Online (Sandbox Code Playgroud)
而预期的结果是
['ab', 'xc', 'sdfdg']
Run Code Online (Sandbox Code Playgroud)
结果中项目的顺序并不重要
过滤器功能必须快,因为列表的大小很大
我目前的解决方案是
l = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']
for i in range(0, len(l) - 1):
for j in range(i + 1, len(l)):
if l[j].startswith(l[i]):
l[j] = l[i]
else:
if l[i].startswith(l[j]):
l[i] = l[j]
print list(set(l))
Run Code Online (Sandbox Code Playgroud)
编辑
在使用大输入数据进行多次测试后,列出1500000个字符串,我的最佳解决方案是:
def filter(l):
if l==[]:
return []
l2=[]
l2.append(l[0])
llen = len(l)
k=0
itter = 0
while k<llen:
addkelem = ''
j=0
l2len = len(l2)
while j<l2len:
if (l2[j].startswith(l[k]) and l[k]!= l2[j]):
l2[j]=l[k]
l.remove(l[k])
llen-=1
j-=1
addkelem = ''
continue
if (l[k].startswith(l2[j])):
addkelem = ''
break
elif(l[k] not in l2):
addkelem = l[k]
j+=1
if addkelem != '':
l2.append(addkelem)
addkelem = ''
k+=1
return l2
Run Code Online (Sandbox Code Playgroud)
执行时间约为213秒
样本输入数据 - 每行是列表中的字符串
Ben*_*ard 12
此算法在我的计算机上以0.97秒完成任务,作者提交的输入文件(154MB):
l.sort()
last_str = l[0]
filtered = [last_str]
app = filtered.append
for str in l:
if not str.startswith(last_str):
last_str = str
app(str)
# Commented because of the massive amount of data to print.
# print filtered
Run Code Online (Sandbox Code Playgroud)
算法很简单:首先按字典顺序对列表进行排序,然后搜索第一个字符串,该字符串不以列表的第一个为前缀,然后搜索下一个没有前缀的前一个字符串,等等.
如果列表已经排序(您的示例文件似乎已经排序),则可以删除该l.sort()行,这将导致时间和内存中的O(n)复杂性.
您可以按首字母对项目进行分组,只搜索子列表,除非字符串至少包含相同的第一个字母,否则任何字符串都不能以子字符串开头:
from collections import defaultdict
def find(l):
d = defaultdict(list)
# group by first letter
for ele in l:
d[ele[0]].append(ele)
for val in d.values():
for v in val:
# check each substring in the sublist
if not any(v.startswith(s) and v != s for s in val):
yield v
print(list(find(l)))
['sdfdg', 'xc', 'ab']
Run Code Online (Sandbox Code Playgroud)
这正确地过滤了单词,正如您从下面的代码中看到的那样,reduce函数不会,'tool'不应该在输出中:
In [56]: l = ["tool",'ab',"too", 'xc', 'abb',"toot", 'abed',"abel", 'sdfdg', 'abfdsdg', 'xccc',"xcew","xrew"]
In [57]: reduce(r,l)
Out[57]: ['tool', 'ab', 'too', 'xc', 'sdfdg', 'xrew']
In [58]: list(find(l))
Out[58]: ['sdfdg', 'too', 'xc', 'xrew', 'ab']
Run Code Online (Sandbox Code Playgroud)
它也有效地做到了:
In [59]: l = ["".join(sample(ascii_lowercase, randint(2,25))) for _ in range(5000)]
In [60]: timeit reduce(r,l)
1 loops, best of 3: 2.12 s per loop
In [61]: timeit list(find(l))
1 loops, best of 3: 203 ms per loop
In [66]: %%timeit
..... result = []
....: for element in lst:
....: is_prefixed = False
....: for possible_prefix in lst:
....: if element is not possible_prefix and element.startswith(possible_prefix):
....: is_prefixed = True
....: break
....: if not is_prefixed:
....: result.append(element)
....:
1 loops, best of 3: 4.39 s per loop
In [92]: timeit list(my_filter(l))
1 loops, best of 3: 2.94 s per loop
Run Code Online (Sandbox Code Playgroud)
如果你知道最小字符串长度总是> 1你可以进一步优化,如果最小长度字符串是2一个单词必须至少有前两个字母的共同点:
def find(l):
d = defaultdict(list)
# find shortest length string to use as key length
mn = len(min(l, key=len))
for ele in l:
d[ele[:mn]].append(ele)
for val in d.values():
for v in val:
if not any(v.startswith(s) and v != s for s in val):
yield v
In [84]: timeit list(find(l))
100 loops, best of 3: 14.6 ms per loop
Run Code Online (Sandbox Code Playgroud)
最后,如果你有欺骗,你可能想要从列表中筛选出重复的单词,但你需要保持它们进行比较:
from collections import defaultdict,Counter
def find(l):
d = defaultdict(list)
mn = len(min(l, key=len))
cn = Counter(l)
for ele in l:
d[ele[:mn]].append(ele)
for val in d.values():
for v in val:
if not any(v.startswith(s) and v != s for s in val):
# make sure v is not a dupe
if cn[v] == 1:
yield v
Run Code Online (Sandbox Code Playgroud)
因此,如果速度很重要,使用上述代码的某些变体的实现将比您的天真方法快得多.内存中还存储了更多数据,因此您也应该考虑到这一点.
为了节省内存,我们可以为每个val/sublist创建一个计数器,这样我们一次只能存储一个单词的计数器字典:
def find(l):
d = defaultdict(list)
mn = len(min(l, key=len))
for ele in l:
d[ele[:mn]].append(ele)
for val in d.values():
# we only need check each grouping of words for dupes
cn = Counter(val)
for v in val:
if not any(v.startswith(s) and v != s for s in val):
if cn[v] == 1:
yield v
Run Code Online (Sandbox Code Playgroud)
创建一个dict每个循环增加5ms所以仍然<20ms为5k字.
如果数据已排序,则reduce方法应该有效:
reduce(r,sorted(l)) # -> ['ab', 'sdfdg', 'too', 'xc', 'xrew']
Run Code Online (Sandbox Code Playgroud)
为了明确行为之间的区别:
In [202]: l = ["tool",'ab',"too", 'xc', 'abb',"toot", 'abed',
"abel", 'sdfdg', 'abfdsdg', 'xccc',"xcew","xrew","ab"]
In [203]: list(filter_list(l))
Out[203]: ['ab', 'too', 'xc', 'sdfdg', 'xrew', 'ab']
In [204]: list(find(l))
Out[204]: ['sdfdg', 'too', 'xc', 'xrew', 'ab', 'ab']
In [205]: reduce(r,sorted(l))
Out[205]: ['ab', 'sdfdg', 'too', 'xc', 'xrew']
In [206]: list(find_dupe(l))
Out[206]: ['too', 'xrew', 'xc', 'sdfdg']
In [207]: list(my_filter(l))
Out[207]: ['sdfdg', 'xrew', 'too', 'xc']
In [208]: "ab".startswith("ab")
Out[208]: True
Run Code Online (Sandbox Code Playgroud)
所以ab重复两次,所以使用一个集合或一个字典而不跟踪时间的ab出现将意味着我们认为没有其他元素满足条件ab "ab".startswith(other ) == True,我们可以看到它是不正确的.
您还可以使用itertools.groupby根据最小索引大小进行分组:
def find_dupe(l):
l.sort()
mn = len(min(l, key=len))
for k, val in groupby(l, key=lambda x: x[:mn]):
val = list(val)
for v in val:
cn = Counter(val)
if not any(v.startswith(s) and v != s for s in val) and cn[v] == 1:
yield v
Run Code Online (Sandbox Code Playgroud)
根据您的评论,我们可以调整我的第一个代码,如果您认为"dd".startswith("dd")重复元素不应该是True:
l = ['abbb', 'xc', 'abb', 'abed', 'sdfdg', 'xc','abfdsdg', 'xccc', 'd','dd','sdfdg', 'xc','abfdsdg', 'xccc', 'd','dd']
def find_with_dupe(l):
d = defaultdict(list)
# group by first letter
srt = sorted(set(l))
ind = len(srt[0])
for ele in srt:
d[ele[:ind]].append(ele)
for val in d.values():
for v in val:
# check each substring in the sublist
if not any(v.startswith(s) and v != s for s in val):
yield v
print(list(find_with_dupe(l)))
['abfdsdg', 'abed', 'abb', 'd', 'sdfdg', 'xc']
Run Code Online (Sandbox Code Playgroud)
在随机的文本样本上运行的运行只需要您自己的代码执行的时间的一小部分:
In [15]: l = open("/home/padraic/Downloads/sample.txt").read().split()
In [16]: timeit list(find(l))
100 loops, best of 3: 19 ms per loop
In [17]: %%timeit
....: l = open("/home/padraic/Downloads/sample.txt").read().split()
....: for i in range(0, len(l) - 1):
....: for j in range(i + 1, len(l)):
....: if l[j].startswith(l[i]):
....: l[j] = l[i]
....: else:
....: if l[i].startswith(l[j]):
....: l[i] = l[j]
....:
1 loops, best of 3: 4.92 s per loop
Run Code Online (Sandbox Code Playgroud)
两者返回相同的输出:
In [41]: l = open("/home/padraic/Downloads/sample.txt").read().split()
In [42]:
for i in range(0, len(l) - 1):
for j in range(i + 1, len(l)):
if l[j].startswith(l[i]):
l[j] = l[i]
else:
if l[i].startswith(l[j]):
l[i] = l[j]
....:
In [43]:
In [43]: l2 = open("/home/padraic/Downloads/sample.txt").read().split()
In [44]: sorted(set(l)) == sorted(find(l2))
Out[44]: True
Run Code Online (Sandbox Code Playgroud)