在字符串中查找第一个非重复字符的最佳方法

San*_*ndy 8 python algorithm

找到字符串的第一个非重复字符的最佳空间和时间效率解决方案是aabccbdcbe什么?

这里的答案是d.因此,让我感到震惊的是,它可以通过两种方式完成:

  1. 对于每个索引,我循环i-1次并检查该字符是否再次出现.但这并不高效:此方法的增长为O(N ^ 2),其中N是字符串的长度.
  2. 另一种可能的好方法是,如果我可以形成树或任何其他ds,以便我可以根据权重(出现次数)对字符进行排序.这可能只需要一个长度为N的循环通过字符串形成结构.那只是O(N)+ O(构建树或任何ds的时间).

Aar*_*our 17

这是一个非常简单的O(n)解决方案:

def fn(s):
  order = []
  counts = {}
  for x in s:
    if x in counts:
      counts[x] += 1
    else:
      counts[x] = 1 
      order.append(x)
  for x in order:
    if counts[x] == 1:
      return x
  return None
Run Code Online (Sandbox Code Playgroud)

我们遍历字符串一次.当我们遇到一个新角色时,我们将其存储counts为值1,并将其附加到order.当我们遇到一个我们以前见过的角色时,我们会增加它的价值counts.最后,我们循环order直到找到值为1in 的字符counts并返回它.


Rom*_*nko 7

我认为从字符串中删除重复字符可能会显着减少操作次数.例如:

s = "aabccbdcbe"
while s != "":
    slen0 = len(s)
    ch = s[0]
    s = s.replace(ch, "")
    slen1 = len(s)
    if slen1 == slen0-1:
        print ch
        break;
else:
    print "No answer"
Run Code Online (Sandbox Code Playgroud)


Tyr*_*ave 6

如果字符仅出现一次,则列表理解将按照它们出现的顺序为您提供:

In [61]: s = 'aabccbdcbe'

In [62]: [a for a in s if s.count(a) == 1]
Out[62]: ['d', 'e']
Run Code Online (Sandbox Code Playgroud)

然后只返回第一个条目:

In [63]: [a for a in s if s.count(a) == 1][0]
Out[63]: 'd'
Run Code Online (Sandbox Code Playgroud)

如果您只需要第一个条目,那么生成器也可以正常工作:

In [69]: (a for a in s if s.count(a) == 1).next()
Out[69]: 'd'
Run Code Online (Sandbox Code Playgroud)

  • 这也是O(N²),因为`count`是O(N),虽然我声称你不能做得更好...... (2认同)

eyq*_*uem 5

搜索速度取决于几个因素:

  • 字符串的长度
  • 之前没有一次性出现的字符的位置
  • 此位置后字符串的大小
  • 字符串中出现的不同字符的数量

.

在下面的代码,我首先定义了一个字符串s
的帮助下random.choice(),并命名一批一次性出现的人物unik
两个字符串s1s2我连击:s1 + s2
其中:

  • s1是一个长度的字符串,nwo其中没有任何一次性出现的字符
  • s2是一个长度的字符串,nwi其中有一次出现的字符

.

#### creation of s from s1 and s2 #########

from random import choice

def without(u,n):
    letters = list('abcdefghijklmnopqrstuvwxyz')
    for i in xrange(n):
        c = choice(letters)
        if c not in unik:
            yield c

def with_un(u,n):
    letters = list('abcdefghijklmnopqrstuvwxyz')
    ecr = []
    for i in xrange(n):
        c = choice(letters)
        #ecr.append('%d %s  len(letters) == %d' % (i,c,len(letters)))
        yield c
        if c in unik:
            letters.remove(c)
    #print '\n'.join(ecr)

unik = 'ekprw'
nwo,nwi = 0,500
s1 = ''.join(c for c in without(unik,nwo))
s2 = ''.join(c for c in with_un(unik,nwi))
s = s1 + s2

if s1:
    print '%-27ss2 : %d chars' % ('s1 : %d chars' % len(s1),len(s2))
    for el in 'ekprw':
        print ('s1.count(%s) == %-12ds2.count(%s) == %d'
               % (el,s1.count(el),el,s2.count(el)))
    others = [c for c in 'abcdefghijklmnopqrstuvwxyz' if c not in unik]
    print 's1.count(others)>1 %s' % all(s1.count(c)>1 for c in others)
else:
    print "s1 == ''     len(s2) == %d" % len(s2)
    for el in 'ekprw':
        print ('   -         s2.count(%s) == %d'
               % (el,s2.count(el)))
print 'len of s  == %d\n' % len(s)
Run Code Online (Sandbox Code Playgroud)

然后是基准测试。
改变数字nwonwi我们看到对速度的影响:

### benchmark of three solutions #################

from time import clock


# Janne Karila
from collections import Counter, OrderedDict
class OrderedCounter(Counter, OrderedDict):
    pass
te = clock()
c = OrderedCounter(s)
rjk = (item for item, count in c.iteritems() if count == 1).next()
tf = clock()-te
print 'Janne Karila  %.5f    found: %s' % (tf,rjk)

# eyquem
te = clock()
candidates = set(s)
li = []
for x in s:
    if x in candidates:
        li.append(x)
        candidates.remove(x)
    elif x in li:
        li.remove(x)
rey = li[0]
tf = clock()-te
print 'eyquem        %.5f    found: %s' % (tf,rey)

# TyrantWave
te = clock()
rty = (a for a in s if s.count(a) == 1).next()
tf = clock()-te
print 'TyrantWave    %.5f    found: %s' % (tf,rty)
Run Code Online (Sandbox Code Playgroud)

.

一些结果

对于s1空长度,nwo = 0 和 nwi = 50:

s1 == ''     len(s2) == 50
   -         s2.count(e) == 1
   -         s2.count(k) == 1
   -         s2.count(p) == 1
   -         s2.count(r) == 1
   -         s2.count(w) == 1
len of s  == 50

Janne Karila  0.00077    found: e
eyquem        0.00013    found: e
TyrantWave    0.00005    found: e
Run Code Online (Sandbox Code Playgroud)

TyrantWave 的解决方案更快,因为在字符串的第一个位置快速找到第一个出现的字符

.

使用 nwo = 300 和 nwi = 50
(以下为 401 个字符,s1因为在 的构造过程中没有保留一次性出现的字符s1,请参阅无 () 函数)

s1 : 245 chars             s2 : 50 chars
s1.count(e) == 0           s2.count(e) == 1
s1.count(k) == 0           s2.count(k) == 1
s1.count(p) == 0           s2.count(p) == 1
s1.count(r) == 0           s2.count(r) == 1
s1.count(w) == 0           s2.count(w) == 1
s1.count(others)>1 True
len of s  == 295

Janne Karila  0.00167    found: e
eyquem        0.00030    found: e
TyrantWave    0.00042    found: e
Run Code Online (Sandbox Code Playgroud)

这次 TyrantWave 的解决方案比我的要长,因为它必须计算第一部分中所有字符的出现次数,s也就是说s1其中没有一次性出现的字符(它们在第二部分中s2
但是,为了使用我的解决方案获得更短的时间,nwo需要明显大于nwi

.

nwo = 300 和 nwi = 5000

s1 : 240 chars             s2 : 5000 chars
s1.count(e) == 0           s2.count(e) == 1
s1.count(k) == 0           s2.count(k) == 1
s1.count(p) == 0           s2.count(p) == 1
s1.count(r) == 0           s2.count(r) == 1
s1.count(w) == 0           s2.count(w) == 1
s1.count(others)>1 True
len of s  == 5240

Janne Karila  0.01510    found: p
eyquem        0.00534    found: p
TyrantWave    0.00294    found: p
Run Code Online (Sandbox Code Playgroud)

如果s2提高了 的长度,那么 TyrantWave 的解决方案再次更好。

.

总结你想要的

.

编辑

罗马的好主意!
我在我的基准测试中添加了 Roman 的解决方案,它赢了!

我还做了一些微小的修改来改进他的解决方案。

# Roman Fursenko
srf = s[:]
te = clock()
while srf != "":
    slen0 = len(srf)
    ch = srf[0]
    srf = srf.replace(ch, "")
    slen1 = len(srf)
    if slen1 == slen0-1:
        rrf = ch
        break
else:
    rrf = "No answer"
tf = clock()-te
print 'Roman Fursenko %.6f    found: %s' % (tf,rrf)

# Roman Fursenko improved
srf = s[:]
te = clock()
while not(srf is ""):
    slen0 = len(srf)
    srf = srf.replace(srf[0], "")
    if len(srf) == slen0-1:
        rrf = ch
        break
else:
    rrf = "No answer"
tf = clock()-te
print 'Roman improved %.6f    found: %s' % (tf,rrf)

print '\nindex of %s in the string :  %d' % (rty,s.index(rrf))
Run Code Online (Sandbox Code Playgroud)

.

结果是:

.

s1 == ''     len(s2) == 50
   -         s2.count(e) == 1
   -         s2.count(k) == 1
   -         s2.count(p) == 1
   -         s2.count(r) == 1
   -         s2.count(w) == 1
len of s  == 50

Janne Karila   0.0032538    found: r
eyquem         0.0001249    found: r
TyrantWave     0.0000534    found: r
Roman Fursenko 0.0000299    found: r
Roman improved 0.0000263    found: r

index of r in the string :  1
Run Code Online (Sandbox Code Playgroud)

s1 == ''     len(s2) == 50
   -         s2.count(e) == 1
   -         s2.count(k) == 0
   -         s2.count(p) == 1
   -         s2.count(r) == 1
   -         s2.count(w) == 1
len of s  == 50

Janne Karila   0.0008183    found: a
eyquem         0.0001285    found: a
TyrantWave     0.0000550    found: a
Roman Fursenko 0.0000433    found: a
Roman improved 0.0000391    found: a

index of a in the string :  4
Run Code Online (Sandbox Code Playgroud)

>

s1 : 240 chars             s2 : 50 chars
s1.count(e) == 0           s2.count(e) == 1
s1.count(k) == 0           s2.count(k) == 0
s1.count(p) == 0           s2.count(p) == 1
s1.count(r) == 0           s2.count(r) == 1
s1.count(w) == 0           s2.count(w) == 1
s1.count(others)>1 True
len of s  == 290

Janne Karila   0.0016390    found: e
eyquem         0.0002956    found: e
TyrantWave     0.0004112    found: e
Roman Fursenko 0.0001428    found: e
Roman improved 0.0001277    found: e

index of e in the string :  242
Run Code Online (Sandbox Code Playgroud)

s1 : 241 chars             s2 : 5000 chars
s1.count(e) == 0           s2.count(e) == 1
s1.count(k) == 0           s2.count(k) == 1
s1.count(p) == 0           s2.count(p) == 1
s1.count(r) == 0           s2.count(r) == 1
s1.count(w) == 0           s2.count(w) == 1
s1.count(others)>1 True
len of s  == 5241

Janne Karila   0.0148231    found: r
eyquem         0.0053283    found: r
TyrantWave     0.0030166    found: r
Roman Fursenko 0.0007414    found: r
Roman improved 0.0007230    found: r

index of r in the string :  250
Run Code Online (Sandbox Code Playgroud)

由于 Roman 的代码,我学到了一些东西:
s.replace()创建一个新字符串,因此我认为这是一种缓慢的方法。
但是,我不知道出于什么原因,这是一种非常快速的方法。

.

编辑 2

Oin 的解决方案是最糟糕的:

# Oin
from operator import itemgetter
seen = set()
only_appear_once = dict()
te = clock()
for i, x in enumerate(s):
  if x in seen and x in only_appear_once:
    only_appear_once.pop(x)
  else:
    seen.add(x)
    only_appear_once[x] = i
  fco = min(only_appear_once.items(),key=itemgetter(1))[0]
tf = clock()-te
print 'Oin            %.7f    found: %s' % (tf,fco)
Run Code Online (Sandbox Code Playgroud)

结果

s1 == ''     len(s2) == 50
Oin            0.0007124    found: e
Janne Karila   0.0008057    found: e
eyquem         0.0001252    found: e
TyrantWave     0.0000712    found: e
Roman Fursenko 0.0000335    found: e
Roman improved 0.0000335    found: e

index of e in the string :  2


s1 : 237 chars             s2 : 50 chars
Oin            0.0029783    found: k
Janne Karila   0.0014714    found: k
eyquem         0.0002889    found: k
TyrantWave     0.0005598    found: k
Roman Fursenko 0.0001458    found: k
Roman improved 0.0001372    found: k

index of k in the string :  246


s1 : 236 chars             s2 : 5000 chars
Oin            0.0801739    found: e
Janne Karila   0.0155715    found: e
eyquem         0.0044623    found: e
TyrantWave     0.0027548    found: e
Roman Fursenko 0.0007255    found: e
Roman improved 0.0007199    found: e

index of e in the string :  244
Run Code Online (Sandbox Code Playgroud)