在Python中查找所有出现的子字符串

nuk*_*ukl 325 python regex string

Python已经string.find()并且string.rfind()在字符串中获取子字符串的索引.

我想知道,也许有类似的东西string.find_all()可以返回所有已创建的索引(不仅从开始或从头到尾)?

例如:

string = "test test test test"

print string.find('test') # 0
print string.rfind('test') # 15

#this is the goal
print string.find_all('test') # [0,5,10,15]
Run Code Online (Sandbox Code Playgroud)

mar*_*cog 473

没有简单的内置字符串函数可以完成您正在寻找的内容,但您可以使用更强大的正则表达式:

import re
[m.start() for m in re.finditer('test', 'test test test test')]
#[0, 5, 10, 15]
Run Code Online (Sandbox Code Playgroud)

如果你想找到重叠的匹配,那么lookahead将会这样做:

[m.start() for m in re.finditer('(?=tt)', 'ttt')]
#[0, 1]
Run Code Online (Sandbox Code Playgroud)

如果你想要一个没有重叠的反向查找,你可以将正面和负面的先行组合成一个像这样的表达式:

search = 'tt'
[m.start() for m in re.finditer('(?=%s)(?!.{1,%d}%s)' % (search, len(search)-1, search), 'ttt')]
#[1]
Run Code Online (Sandbox Code Playgroud)

re.finditer返回一个生成器,因此您可以更改[]上面的内容()以获取生成器而不是列表,如果您只迭代结果一次,这将更有效.

  • 您想要查看正则表达式:https://docs.python.org/2/howto/regex.html.您的问题的解决方案将是:[m.start()for re in finditer('te [sx] t','text test text test')] (7认同)
  • 我建议也转义搜索字符串,如下所示: `[m.start() for m in re.finditer(re.escape(search_str), input_str)]` (4认同)
  • 使用这种方法的时间复杂度是多少? (3认同)
  • @PranjalMittal。上限还是下限?最好、最坏或平均情况? (2认同)

Kar*_*tel 98

>>> help(str.find)
Help on method_descriptor:

find(...)
    S.find(sub [,start [,end]]) -> int
Run Code Online (Sandbox Code Playgroud)

因此,我们可以自己构建它:

def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1: return
        yield start
        start += len(sub) # use start += 1 to find overlapping matches

list(find_all('spam spam spam spam', 'spam')) # [0, 5, 10, 15]
Run Code Online (Sandbox Code Playgroud)

不需要临时字符串或正则表达式.

  • 要获得重叠匹配,应该用`start + = 1`替换`start + = len(sub)`. (20认同)
  • 我相信你之前的评论应该是你答案中的后记. (4认同)
  • 为了匹配`re.findall`的行为,我建议添加`len(sub)或1`而不是`len(sub)`,否则这个生成器永远不会在空子串上终止. (3认同)
  • 另见我所做的评论.这是重叠匹配的一个例子. (2认同)

thk*_*ala 44

这是获得所有(即使是重叠)匹配的(非常低效)方式:

>>> string = "test test test test"
>>> [i for i in range(len(string)) if string.startswith('test', i)]
[0, 5, 10, 15]
Run Code Online (Sandbox Code Playgroud)

  • @thkala 在不使用 re 模块的情况下执行操作的非常聪明的方式。感谢你的回答! (3认同)

Aki*_*oss 22

再次,旧线程,但这是我使用生成器和普通的解决方案str.find.

def findall(p, s):
    '''Yields all the positions of
    the pattern p in the string s.'''
    i = s.find(p)
    while i != -1:
        yield i
        i = s.find(p, i+1)
Run Code Online (Sandbox Code Playgroud)

x = 'banananassantana'
[(i, x[i:i+2]) for i in findall('na', x)]
Run Code Online (Sandbox Code Playgroud)

回报

[(2, 'na'), (4, 'na'), (6, 'na'), (14, 'na')]
Run Code Online (Sandbox Code Playgroud)

  • 经测试,它比“re.finditer”解决方案快两倍:“str.find”解决方案为“310 ns ± 5.35 ns 每个循环” *vs* 对于“re”解决方案为“799 ns ± 5.72 ns 每个循环”。 finditer`(在我的机器上)。证实了我过去注意到的事情:内置字符串方法通常比正则表达式更快(嵌套的“str.replace”与“re.sub”相同) (5认同)
  • 这看起来很漂亮! (3认同)
  • 最漂亮的解决方案。请注意,通过引入可选参数“overlapping=True”并将“i+1”替换为“i + (1 ifoverlapping else len(p))”可以轻松概括。 (2认同)

Chi*_*chi 21

您可以使用re.finditer()非重叠匹配.

>>> import re
>>> aString = 'this is a string where the substring "is" is repeated several times'
>>> print [(a.start(), a.end()) for a in list(re.finditer('is', aString))]
[(2, 4), (5, 7), (38, 40), (42, 44)]
Run Code Online (Sandbox Code Playgroud)

不适用于:

In [1]: aString="ababa"

In [2]: print [(a.start(), a.end()) for a in list(re.finditer('aba', aString))]
Output: [(0, 3)]
Run Code Online (Sandbox Code Playgroud)

  • 为什么要从迭代器中创建一个列表,它只会减慢进程. (12认同)
  • aString VS astring;) (2认同)

Cod*_*all 17

来吧,让我们一起复说吧.

def locations_of_substring(string, substring):
    """Return a list of locations of a substring."""

    substring_length = len(substring)    
    def recurse(locations_found, start):
        location = string.find(substring, start)
        if location != -1:
            return recurse(locations_found + [location], location+substring_length)
        else:
            return locations_found

    return recurse([], 0)

print(locations_of_substring('this is a test for finding this and this', 'this'))
# prints [0, 27, 36]
Run Code Online (Sandbox Code Playgroud)

这种方式不需要正则表达式.

  • 这段代码有几个问题.由于它迟早会处理开放式数据,如果有足够多的事件发生,你将遇到"RecursionError".另一个是它在每次迭代时创建的两个抛弃列表,仅仅是为了附加一个元素,这对于字符串查找函数来说非常不理想,这可能被称为很多次.虽然有时递归函数看起来优雅而清晰,但应谨慎使用. (3认同)

jst*_*aab 11

如果您只是寻找一个角色,这将有效:

string = "dooobiedoobiedoobie"
match = 'o'
reduce(lambda count, char: count + 1 if char == match else count, string, 0)
# produces 7
Run Code Online (Sandbox Code Playgroud)

也,

string = "test test test test"
match = "test"
len(string.split(match)) - 1
# produces 4
Run Code Online (Sandbox Code Playgroud)

我的预感是,这些(特别是#2)都不是非常高效.


小智 8

这是一个老线程,但我感兴趣,并希望分享我的解决方案.

def find_all(a_string, sub):
    result = []
    k = 0
    while k < len(a_string):
        k = a_string.find(sub, k)
        if k == -1:
            return result
        else:
            result.append(k)
            k += 1 #change to k += len(sub) to not search overlapping results
    return result
Run Code Online (Sandbox Code Playgroud)

它应该返回找到子字符串的位置列表.如果您发现错误或改进空间,请发表评论.


Bru*_*len 8

这对我使用 re.finditer 有用

import re

text = 'This is sample text to test if this pythonic '\
       'program can serve as an indexing platform for '\
       'finding words in a paragraph. It can give '\
       'values as to where the word is located with the '\
       'different examples as stated'

#  find all occurances of the word 'as' in the above text

find_the_word = re.finditer('as', text)

for match in find_the_word:
    print('start {}, end {}, search string \'{}\''.
          format(match.start(), match.end(), match.group()))
Run Code Online (Sandbox Code Playgroud)


Moh*_*ari 7

你可以试试 :

import re
str1 = "This dress looks good; you have good taste in clothes."
substr = "good"
result = [_.start() for _ in re.finditer(substr, str1)]
# result = [17, 32]
Run Code Online (Sandbox Code Playgroud)


小智 5

这个帖子有点旧,但这对我有用:

numberString = "onetwothreefourfivesixseveneightninefiveten"
testString = "five"

marker = 0
while marker < len(numberString):
    try:
        print(numberString.index("five",marker))
        marker = numberString.index("five", marker) + 1
    except ValueError:
        print("String not found")
        marker = len(numberString)
Run Code Online (Sandbox Code Playgroud)


Har*_*ani 5

你可以试试 :

>>> string = "test test test test"
>>> for index,value in enumerate(string):
    if string[index:index+(len("test"))] == "test":
        print index

0
5
10
15
Run Code Online (Sandbox Code Playgroud)