在Python中,如何检查字符串是否只包含某些字符？

Question

在Python中,如何检查字符串是否只包含某些字符？

e70*_*e70 47 python regex search character

我需要检查一个只包含a..z,0..9和的字符串.(期间),没有其他性格.

我可以迭代每个字符并检查字符是a ..z或0..9,或.但那会很慢.

我现在还不清楚如何使用正则表达式来完成它.

它是否正确？你能建议一个更简单的正则表达式或更有效的方法吗？

#Valid chars . a-z 0-9 
def check(test_str):
    import re
    #http://docs.python.org/library/re.html
    #re.search returns None if no position in the string matches the pattern
    #pattern to search for any character other then . a-z 0-9
    pattern = r'[^\.a-z0-9]'
    if re.search(pattern, test_str):
        #Character other then . a-z 0-9 was found
        print 'Invalid : %r' % (test_str,)
    else:
        #No character other then . a-z 0-9 was found
        print 'Valid   : %r' % (test_str,)

check(test_str='abcde.1')
check(test_str='abcde.1#')
check(test_str='ABCDE.12')
check(test_str='_-/>"!@#12345abcde<')

'''
Output:
>>> 
Valid   : "abcde.1"
Invalid : "abcde.1#"
Invalid : "ABCDE.12"
Invalid : "_-/>"!@#12345abcde<"
'''

Run Code Online (Sandbox Code Playgroud)

Answer 1

Joh*_*kin 59

这是一个简单的纯Python实现.当性能不重要时(包含在未来的Google员工中),应该使用它.

import string
allowed = set(string.ascii_lowercase + string.digits + '.')

def check(test_str):
    set(test_str) <= allowed

Run Code Online (Sandbox Code Playgroud)

关于性能,迭代可能是最快的方法.正则表达式必须遍历状态机,并且集合相等解决方案必须构建临时集.但是,差异不大可能太重要.如果此函数的性能非常重要,请将其写为带有switch语句的C扩展模块(将编译为跳转表).

这是一个C实现,由于空间限制而使用if语句.如果您绝对需要一点点额外的速度,请写出开关盒.在我的测试中,它表现得非常好(对正则表达式的基准测试中,2秒对9秒).

#define PY_SSIZE_T_CLEAN
#include <Python.h>

static PyObject *check(PyObject *self, PyObject *args)
{
        const char *s;
        Py_ssize_t count, ii;
        char c;
        if (0 == PyArg_ParseTuple (args, "s#", &s, &count)) {
                return NULL;
        }
        for (ii = 0; ii < count; ii++) {
                c = s[ii];
                if ((c < '0' && c != '.') || c > 'z') {
                        Py_RETURN_FALSE;
                }
                if (c > '9' && c < 'a') {
                        Py_RETURN_FALSE;
                }
        }

        Py_RETURN_TRUE;
}

PyDoc_STRVAR (DOC, "Fast stringcheck");
static PyMethodDef PROCEDURES[] = {
        {"check", (PyCFunction) (check), METH_VARARGS, NULL},
        {NULL, NULL}
};
PyMODINIT_FUNC
initstringcheck (void) {
        Py_InitModule3 ("stringcheck", PROCEDURES, DOC);
}

Run Code Online (Sandbox Code Playgroud)

将它包含在您的setup.py中:

from distutils.core import setup, Extension
ext_modules = [
    Extension ('stringcheck', ['stringcheck.c']),
],

Run Code Online (Sandbox Code Playgroud)

用于:

>>> from stringcheck import check
>>> check("abc")
True
>>> check("ABC")
False

Run Code Online (Sandbox Code Playgroud)

我不能说我喜欢将解决方案视为对"它比我/另一种解决方案更慢**"的反应.如果它是错误的**,那么downvoting是有道理的.但即使在"代码高尔夫"问题中,任何不是最小的答案都不会被低估,但随着时间的推移,它不会得到尽可能多的赞成. (5认同)
如果函数对无效文本返回"true",则失败.异常是意外的,但不允许执行沿着代码路径继续执行正确的字符串,因此不是失败.如果数据从外部源(例如从文件或数据库)提取到程序中,则是用户输入,应在使用前进行检查.这包括检查字符串是否有效UTF-8(或任何编码用于存储). (3认同)
@Nadia:您的解决方案不正确.如果我想要快速和错误的结果,我会问我的猫. (2认同)

Answer 2

Joh*_*hin 35

最终(？)编辑

使用带注释的交互式会话回答,包含在函数中:

>>> import re
>>> def special_match(strg, search=re.compile(r'[^a-z0-9.]').search):
...     return not bool(search(strg))
...
>>> special_match("")
True
>>> special_match("az09.")
True
>>> special_match("az09.\n")
False
# The above test case is to catch out any attempt to use re.match()
# with a `$` instead of `\Z` -- see point (6) below.
>>> special_match("az09.#")
False
>>> special_match("az09.X")
False
>>>

Run Code Online (Sandbox Code Playgroud)

注意:在此答案中进一步使用re.match()进行了比较.进一步的时间表明match()将赢得更长的字符串; 当最终答案为True时,match()似乎比search()有更大的开销; 这很令人费解(也许是返回MatchObject而不是None的成本)并且可能需要进一步翻找.

==== Earlier text ====

Run Code Online (Sandbox Code Playgroud)

[先前]接受的答案可以使用一些改进:

(1)演示文稿给出了交互式Python会话的结果:

reg=re.compile('^[a-z0-9\.]+$')
>>>reg.match('jsdlfjdsf12324..3432jsdflsdf')
True

Run Code Online (Sandbox Code Playgroud)

但是match()没有返回 True

(2)与match()^一起使用时,模式的开头是多余的,并且看起来比没有相同模式的模式略慢^

(3)应该不经意地为任何重新模式促进原始字符串的使用

(4)点/周期前面的反斜杠是多余的

(5)比OP的代码慢!

prompt>rem OP's version -- NOTE: OP used raw string!

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[^a-z0-9\.]')" "not bool(reg.search(t))"
1000000 loops, best of 3: 1.43 usec per loop

prompt>rem OP's version w/o backslash

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[^a-z0-9.]')" "not bool(reg.search(t))"
1000000 loops, best of 3: 1.44 usec per loop

prompt>rem cleaned-up version of accepted answer

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[a-z0-9.]+\Z')" "bool(reg.match(t))"
100000 loops, best of 3: 2.07 usec per loop

prompt>rem accepted answer

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile('^[a-z0-9\.]+$')" "bool(reg.match(t))"
100000 loops, best of 3: 2.08 usec per loop

Run Code Online (Sandbox Code Playgroud)

(6)可以产生错误的答案!!

>>> import re
>>> bool(re.compile('^[a-z0-9\.]+$').match('1234\n'))
True # uh-oh
>>> bool(re.compile('^[a-z0-9\.]+\Z').match('1234\n'))
False

Run Code Online (Sandbox Code Playgroud)

+1谢谢你纠正我的回答.我忘记了匹配仅在字符串的开头检查匹配.Ingenutrix,我认为你应该选择这个答案. (3认同)

Answer 3

Mar*_*off 29

更简单的方法？多一点Pythonic？

>>> ok = "0123456789abcdef"
>>> all(c in ok for c in "123456abc")
True
>>> all(c in ok for c in "hello world")
False

Run Code Online (Sandbox Code Playgroud)

它当然不是最有效的,但它确实可读.

`ok = dict.fromkeys("012345789abcdef")`可以在不损害可读性的情况下加快速度. (3认同)

Answer 4

Nad*_*mli 14

编辑:更改正则表达式以排除AZ

正则表达式解决方案是目前为止最快的纯python解决方案

reg=re.compile('^[a-z0-9\.]+$')
>>>reg.match('jsdlfjdsf12324..3432jsdflsdf')
True
>>> timeit.Timer("reg.match('jsdlfjdsf12324..3432jsdflsdf')", "import re; reg=re.compile('^[a-z0-9\.]+$')").timeit()
0.70509696006774902

Run Code Online (Sandbox Code Playgroud)

与其他解决方案相比:

>>> timeit.Timer("set('jsdlfjdsf12324..3432jsdflsdf') <= allowed", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
3.2119350433349609
>>> timeit.Timer("all(c in allowed for c in 'jsdlfjdsf12324..3432jsdflsdf')", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
6.7066690921783447

Run Code Online (Sandbox Code Playgroud)

如果要允许空字符串,请将其更改为:

reg=re.compile('^[a-z0-9\.]*$')
>>>reg.match('')
False

Run Code Online (Sandbox Code Playgroud)

根据要求,我将返回答案的其他部分.但请注意以下接受AZ范围.

你可以使用isalnum

test_str.replace('.', '').isalnum()

>>> 'test123.3'.replace('.', '').isalnum()
True
>>> 'test123-3'.replace('.', '').isalnum()
False

Run Code Online (Sandbox Code Playgroud)

编辑使用isalnum比设置解决方案更有效

>>> timeit.Timer("'jsdlfjdsf12324..3432jsdflsdf'.replace('.', '').isalnum()").timeit()
0.63245487213134766

Run Code Online (Sandbox Code Playgroud)

EDIT2 John举了一个例子,上面的内容不起作用.我改变了解决方案,通过使用编码来克服这种特殊情况

test_str.replace('.', '').encode('ascii', 'replace').isalnum()

Run Code Online (Sandbox Code Playgroud)

它仍然比设定的解决方案快3倍

timeit.Timer("u'ABC\u0131\u0661'.encode('ascii', 'replace').replace('.','').isalnum()", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
1.5719811916351318

Run Code Online (Sandbox Code Playgroud)

在我看来,使用正则表达式是解决这个问题的最佳方法

Answer 5

Kin*_*cal 5

这已经得到了令人满意的回答，但对于事后遇到此问题的人们，我已经对实现此目的的几种不同方法进行了一些分析。就我而言，我想要大写的十六进制数字，因此请根据需要进行修改以满足您的需求。

这是我的测试实现：

import re

hex_digits = set("ABCDEF1234567890")
hex_match = re.compile(r'^[A-F0-9]+\Z')
hex_search = re.compile(r'[^A-F0-9]')

def test_set(input):
    return set(input) <= hex_digits

def test_not_any(input):
    return not any(c not in hex_digits for c in input)

def test_re_match1(input):
    return bool(re.compile(r'^[A-F0-9]+\Z').match(input))

def test_re_match2(input):
    return bool(hex_match.match(input))

def test_re_match3(input):
    return bool(re.match(r'^[A-F0-9]+\Z', input))

def test_re_search1(input):
    return not bool(re.compile(r'[^A-F0-9]').search(input))

def test_re_search2(input):
    return not bool(hex_search.search(input))

def test_re_search3(input):
    return not bool(re.match(r'[^A-F0-9]', input))

Run Code Online (Sandbox Code Playgroud)

在 Mac OS X 上的 Python 3.4.0 中进行测试：

import cProfile
import pstats
import random

# generate a list of 10000 random hex strings between 10 and 10009 characters long
# this takes a little time; be patient
tests = [ ''.join(random.choice("ABCDEF1234567890") for _ in range(l)) for l in range(10, 10010) ]

# set up profiling, then start collecting stats
test_pr = cProfile.Profile(timeunit=0.000001)
test_pr.enable()

# run the test functions against each item in tests. 
# this takes a little time; be patient
for t in tests:
    for tf in [test_set, test_not_any, 
               test_re_match1, test_re_match2, test_re_match3,
               test_re_search1, test_re_search2, test_re_search3]:
        _ = tf(t)

# stop collecting stats
test_pr.disable()

# we create our own pstats.Stats object to filter 
# out some stuff we don't care about seeing
test_stats = pstats.Stats(test_pr)

# normally, stats are printed with the format %8.3f, 
# but I want more significant digits
# so this monkey patch handles that
def _f8(x):
    return "%11.6f" % x

def _print_title(self):
    print('   ncalls     tottime     percall     cumtime     percall', end=' ', file=self.stream)
    print('filename:lineno(function)', file=self.stream)

pstats.f8 = _f8
pstats.Stats.print_title = _print_title

# sort by cumulative time (then secondary sort by name), ascending
# then print only our test implementation function calls:
test_stats.sort_stats('cumtime', 'name').reverse_order().print_stats("test_*")

Run Code Online (Sandbox Code Playgroud)

结果如下：

         13.428 秒内调用了 50335004 个函数

   排序方式：累计时间、函数名称
   由于限制，名单从20个减少到8个

   ncalls tottime percall cumtime percall filename:lineno(function)
    10000 0.005233 0.000001 0.367360 0.000037 :1(test_re_match2)
    10000 0.006248 0.000001 0.378853 0.000038 :1(test_re_match3)
    10000 0.010710 0.000001 0.395770 0.000040 :1(test_re_match1)
    10000 0.004578 0.000000 0.467386 0.000047 :1(test_re_search2)
    10000 0.005994 0.000001 0.475329 0.000048 :1(test_re_search3)
    10000 0.008100 0.000001 0.482209 0.000048 :1(test_re_search1)
    10000 0.863139 0.000086 0.863139 0.000086 :1(测试集)
    10000 0.007414 0.000001 9.962580 0.000996 :1(test_not_any)

在哪里：

呼叫: 该函数被调用的次数
总时间: 在给定函数中花费的总时间，不包括子函数的时间
珀考尔: tottime 除以 ncalls 的商
兼时: 在此子函数和所有子函数中花费的累计时间
珀考尔: cumtime 除以原始调用的商

我们真正关心的列是 cumtime 和 percall，因为它们向我们显示了从函数进入到退出所需的实际时间。正如我们所看到的，正则表达式匹配和搜索并没有太大的不同。

如果您每次都编译正则表达式，那么不编译它会更快。编译一次比每次编译快大约 7.5%，但编译只比不编译快 2.5%。

test_set 比 re_search 慢两倍，比 re_match 慢三倍

test_not_any 比 test_set 慢了整整一个数量级

TL;DR：使用 re.match 或 re.search

归档时间：	16 年，5 月前
查看次数：	112584 次
最近记录：	7 年前