Sun*_*rma 132 python performance cpython timeit python-internals
我正在玩timeit并注意到对一个小字符串做一个简单的列表理解比在一个小的单个字符串列表上做同样的操作要花费更长的时间.任何解释?这几乎是1.35倍的时间.
>>> from timeit import timeit
>>> timeit("[x for x in 'abc']")
2.0691067844831528
>>> timeit("[x for x in ['a', 'b', 'c']]")
1.5286479570345861
Run Code Online (Sandbox Code Playgroud)
在较低的水平上发生了什么导致这种情况?
Vee*_*rac 194
对于Python 2,一旦消除了大量开销,实际速度差异接近70%(或更多).
对象创建没有错.这两种方法都不会创建新对象,因为缓存了单字符字符串.
差异是不明显的,但可能是由于对字符串索引的更多检查,关于类型和格式良好.由于需要检查返回什么,这很可能也很可能.
列表索引速度非常快.
>>> python3 -m timeit '[x for x in "abc"]'
1000000 loops, best of 3: 0.388 usec per loop
>>> python3 -m timeit '[x for x in ["a", "b", "c"]]'
1000000 loops, best of 3: 0.436 usec per loop
Run Code Online (Sandbox Code Playgroud)
这不同意你发现的......
那你必须使用Python 2.
>>> python2 -m timeit '[x for x in "abc"]'
1000000 loops, best of 3: 0.309 usec per loop
>>> python2 -m timeit '[x for x in ["a", "b", "c"]]'
1000000 loops, best of 3: 0.212 usec per loop
Run Code Online (Sandbox Code Playgroud)
我们来解释版本之间的区别.我将检查已编译的代码.
对于Python 3:
import dis
def list_iterate():
[item for item in ["a", "b", "c"]]
dis.dis(list_iterate)
#>>> 4 0 LOAD_CONST 1 (<code object <listcomp> at 0x7f4d06b118a0, file "", line 4>)
#>>> 3 LOAD_CONST 2 ('list_iterate.<locals>.<listcomp>')
#>>> 6 MAKE_FUNCTION 0
#>>> 9 LOAD_CONST 3 ('a')
#>>> 12 LOAD_CONST 4 ('b')
#>>> 15 LOAD_CONST 5 ('c')
#>>> 18 BUILD_LIST 3
#>>> 21 GET_ITER
#>>> 22 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
#>>> 25 POP_TOP
#>>> 26 LOAD_CONST 0 (None)
#>>> 29 RETURN_VALUE
def string_iterate():
[item for item in "abc"]
dis.dis(string_iterate)
#>>> 21 0 LOAD_CONST 1 (<code object <listcomp> at 0x7f4d06b17150, file "", line 21>)
#>>> 3 LOAD_CONST 2 ('string_iterate.<locals>.<listcomp>')
#>>> 6 MAKE_FUNCTION 0
#>>> 9 LOAD_CONST 3 ('abc')
#>>> 12 GET_ITER
#>>> 13 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
#>>> 16 POP_TOP
#>>> 17 LOAD_CONST 0 (None)
#>>> 20 RETURN_VALUE
Run Code Online (Sandbox Code Playgroud)
您在此处看到,由于每次都在构建列表,列表变体可能会变慢.
这是
9 LOAD_CONST 3 ('a')
12 LOAD_CONST 4 ('b')
15 LOAD_CONST 5 ('c')
18 BUILD_LIST 3
Run Code Online (Sandbox Code Playgroud)
部分.字符串变体只有
9 LOAD_CONST 3 ('abc')
Run Code Online (Sandbox Code Playgroud)
您可以检查这似乎有所不同:
def string_iterate():
[item for item in ("a", "b", "c")]
dis.dis(string_iterate)
#>>> 35 0 LOAD_CONST 1 (<code object <listcomp> at 0x7f4d068be660, file "", line 35>)
#>>> 3 LOAD_CONST 2 ('string_iterate.<locals>.<listcomp>')
#>>> 6 MAKE_FUNCTION 0
#>>> 9 LOAD_CONST 6 (('a', 'b', 'c'))
#>>> 12 GET_ITER
#>>> 13 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
#>>> 16 POP_TOP
#>>> 17 LOAD_CONST 0 (None)
#>>> 20 RETURN_VALUE
Run Code Online (Sandbox Code Playgroud)
这只会产生
9 LOAD_CONST 6 (('a', 'b', 'c'))
Run Code Online (Sandbox Code Playgroud)
因为元组是不可变的.测试:
>>> python3 -m timeit '[x for x in ("a", "b", "c")]'
1000000 loops, best of 3: 0.369 usec per loop
Run Code Online (Sandbox Code Playgroud)
很棒,回到速度.
对于Python 2:
def list_iterate():
[item for item in ["a", "b", "c"]]
dis.dis(list_iterate)
#>>> 2 0 BUILD_LIST 0
#>>> 3 LOAD_CONST 1 ('a')
#>>> 6 LOAD_CONST 2 ('b')
#>>> 9 LOAD_CONST 3 ('c')
#>>> 12 BUILD_LIST 3
#>>> 15 GET_ITER
#>>> >> 16 FOR_ITER 12 (to 31)
#>>> 19 STORE_FAST 0 (item)
#>>> 22 LOAD_FAST 0 (item)
#>>> 25 LIST_APPEND 2
#>>> 28 JUMP_ABSOLUTE 16
#>>> >> 31 POP_TOP
#>>> 32 LOAD_CONST 0 (None)
#>>> 35 RETURN_VALUE
def string_iterate():
[item for item in "abc"]
dis.dis(string_iterate)
#>>> 2 0 BUILD_LIST 0
#>>> 3 LOAD_CONST 1 ('abc')
#>>> 6 GET_ITER
#>>> >> 7 FOR_ITER 12 (to 22)
#>>> 10 STORE_FAST 0 (item)
#>>> 13 LOAD_FAST 0 (item)
#>>> 16 LIST_APPEND 2
#>>> 19 JUMP_ABSOLUTE 7
#>>> >> 22 POP_TOP
#>>> 23 LOAD_CONST 0 (None)
#>>> 26 RETURN_VALUE
Run Code Online (Sandbox Code Playgroud)
奇怪的是,我们有相同的列表构建,但它仍然更快.Python 2的行为异常迅速.
让我们删除理解并重新计算.这_ =
是为了防止它被优化.
>>> python3 -m timeit '_ = ["a", "b", "c"]'
10000000 loops, best of 3: 0.0707 usec per loop
>>> python3 -m timeit '_ = "abc"'
100000000 loops, best of 3: 0.0171 usec per loop
Run Code Online (Sandbox Code Playgroud)
我们可以看到初始化不足以说明版本之间的差异(这些数字很小)!因此,我们可以得出结论,Python 3的理解速度较慢.这是有道理的,因为Python 3改变了理解,使其具有更安全的范围.
好吧,现在改进基准测试(我只是删除不是迭代的开销).这通过预分配来删除迭代的构建:
>>> python3 -m timeit -s 'iterable = "abc"' '[x for x in iterable]'
1000000 loops, best of 3: 0.387 usec per loop
>>> python3 -m timeit -s 'iterable = ["a", "b", "c"]' '[x for x in iterable]'
1000000 loops, best of 3: 0.368 usec per loop
Run Code Online (Sandbox Code Playgroud)
>>> python2 -m timeit -s 'iterable = "abc"' '[x for x in iterable]'
1000000 loops, best of 3: 0.309 usec per loop
>>> python2 -m timeit -s 'iterable = ["a", "b", "c"]' '[x for x in iterable]'
10000000 loops, best of 3: 0.164 usec per loop
Run Code Online (Sandbox Code Playgroud)
We can check if calling iter
is the overhead:
>>> python3 -m timeit -s 'iterable = "abc"' 'iter(iterable)'
10000000 loops, best of 3: 0.099 usec per loop
>>> python3 -m timeit -s 'iterable = ["a", "b", "c"]' 'iter(iterable)'
10000000 loops, best of 3: 0.1 usec per loop
Run Code Online (Sandbox Code Playgroud)
>>> python2 -m timeit -s 'iterable = "abc"' 'iter(iterable)'
10000000 loops, best of 3: 0.0913 usec per loop
>>> python2 -m timeit -s 'iterable = ["a", "b", "c"]' 'iter(iterable)'
10000000 loops, best of 3: 0.0854 usec per loop
Run Code Online (Sandbox Code Playgroud)
No. No it is not. The difference is too small, especially for Python 3.
So let's remove yet more unwanted overhead... by making the whole thing slower! The aim is just to have a longer iteration so the time hides overhead.
>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' '[x for x in iterable]'
100 loops, best of 3: 3.12 msec per loop
>>> python3 -m timeit -s 'import random; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' '[x for x in iterable]'
100 loops, best of 3: 2.77 msec per loop
Run Code Online (Sandbox Code Playgroud)
>>> python2 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' '[x for x in iterable]'
100 loops, best of 3: 2.32 msec per loop
>>> python2 -m timeit -s 'import random; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' '[x for x in iterable]'
100 loops, best of 3: 2.09 msec per loop
Run Code Online (Sandbox Code Playgroud)
This hasn't actually changed much, but it's helped a little.
So remove the comprehension. It's overhead that's not part of the question:
>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'for x in iterable: pass'
1000 loops, best of 3: 1.71 msec per loop
>>> python3 -m timeit -s 'import random; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'for x in iterable: pass'
1000 loops, best of 3: 1.36 msec per loop
Run Code Online (Sandbox Code Playgroud)
>>> python2 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'for x in iterable: pass'
1000 loops, best of 3: 1.27 msec per loop
>>> python2 -m timeit -s 'import random; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'for x in iterable: pass'
1000 loops, best of 3: 935 usec per loop
Run Code Online (Sandbox Code Playgroud)
That's more like it! We can get slightly faster still by using deque
to iterate. It's basically the same, but it's faster:
>>> python3 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 777 usec per loop
>>> python3 -m timeit -s 'import random; from collections import deque; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 405 usec per loop
Run Code Online (Sandbox Code Playgroud)
>>> python2 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 805 usec per loop
>>> python2 -m timeit -s 'import random; from collections import deque; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 438 usec per loop
Run Code Online (Sandbox Code Playgroud)
What impresses me is that Unicode is competitive with bytestrings. We can check this explicitly by trying bytes
and unicode
in both:
bytes
>>> python3 -m timeit -s 'import random; from collections import deque; iterable = b"".join(chr(random.randint(0, 127)).encode("ascii") for _ in range(100000))' 'deque(iterable, maxlen=0)' :(
1000 loops, best of 3: 571 usec per loop
>>> python3 -m timeit -s 'import random; from collections import deque; iterable = [chr(random.randint(0, 127)).encode("ascii") for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 394 usec per loop
Run Code Online (Sandbox Code Playgroud)
>>> python2 -m timeit -s 'import random; from collections import deque; iterable = b"".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 757 usec per loop
>>> python2 -m timeit -s 'import random; from collections import deque; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 438 usec per loop
Run Code Online (Sandbox Code Playgroud)
Here you see Python 3 actually faster than Python 2.
unicode
>>> python3 -m timeit -s 'import random; from collections import deque; iterable = u"".join( chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 800 usec per loop
>>> python3 -m timeit -s 'import random; from collections import deque; iterable = [ chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 394 usec per loop
Run Code Online (Sandbox Code Playgroud)
>>> python2 -m timeit -s 'import random; from collections import deque; iterable = u"".join(unichr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 1.07 msec per loop
>>> python2 -m timeit -s 'import random; from collections import deque; iterable = [unichr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 469 usec per loop
Run Code Online (Sandbox Code Playgroud)
Again, Python 3 is faster, although this is to be expected (str
has had a lot of attention in Python 3).
In fact, this unicode
-bytes
difference is very small, which is impressive.
So let's analyse this one case, seeing as it's fast and convenient for me:
>>> python3 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 777 usec per loop
>>> python3 -m timeit -s 'import random; from collections import deque; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 405 usec per loop
Run Code Online (Sandbox Code Playgroud)
We can actually rule out Tim Peter's 10-times-upvoted answer!
>>> foo = iterable[123]
>>> iterable[36] is foo
True
Run Code Online (Sandbox Code Playgroud)
But this is worth mentioning: indexing costs. The difference will likely be in the indexing, so remove the iteration and just index:
>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'iterable[123]'
10000000 loops, best of 3: 0.0397 usec per loop
>>> python3 -m timeit -s 'import random; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'iterable[123]'
10000000 loops, best of 3: 0.0374 usec per loop
Run Code Online (Sandbox Code Playgroud)
The difference seems small, but at least half of the cost is overhead:
>>> python3 -m timeit -s 'import random; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'iterable; 123'
100000000 loops, best of 3: 0.0173 usec per loop
Run Code Online (Sandbox Code Playgroud)
so the speed difference is sufficient to decide to blame it. I think.
So why is indexing a list so much faster?
Well, I'll come back to you on that, but my guess is that's is down to the check for interned strings (or cached characters if it's a separate mechanism). This will be less fast than optimal. But I'll go check the source (although I'm not comfortable in C...) :).
So here's the source:
static PyObject *
unicode_getitem(PyObject *self, Py_ssize_t index)
{
void *data;
enum PyUnicode_Kind kind;
Py_UCS4 ch;
PyObject *res;
if (!PyUnicode_Check(self) || PyUnicode_READY(self) == -1) {
PyErr_BadArgument();
return NULL;
}
if (index < 0 || index >= PyUnicode_GET_LENGTH(self)) {
PyErr_SetString(PyExc_IndexError, "string index out of range");
return NULL;
}
kind = PyUnicode_KIND(self);
data = PyUnicode_DATA(self);
ch = PyUnicode_READ(kind, data, index);
if (ch < 256)
return get_latin1_char(ch);
res = PyUnicode_New(1, ch);
if (res == NULL)
return NULL;
kind = PyUnicode_KIND(res);
data = PyUnicode_DATA(res);
PyUnicode_WRITE(kind, data, 0, ch);
assert(_PyUnicode_CheckConsistency(res, 1));
return res;
}
Run Code Online (Sandbox Code Playgroud)
Walking from the top, we'll have some checks. These are boring. Then some assigns, which should also be boring. The first interesting line is
ch = PyUnicode_READ(kind, data, index);
Run Code Online (Sandbox Code Playgroud)
but we'd hope that is fast, as we're reading from a contiguous C array by indexing it. The result, ch
, will be less than 256 so we'll return the cached character in get_latin1_char(ch)
.
So we'll run (dropping the first checks)
kind = PyUnicode_KIND(self);
data = PyUnicode_DATA(self);
ch = PyUnicode_READ(kind, data, index);
return get_latin1_char(ch);
Run Code Online (Sandbox Code Playgroud)
Where
#define PyUnicode_KIND(op) \
(assert(PyUnicode_Check(op)), \
assert(PyUnicode_IS_READY(op)), \
((PyASCIIObject *)(op))->state.kind)
Run Code Online (Sandbox Code Playgroud)
(which is boring because asserts get ignored in debug [so I can check that they're fast] and ((PyASCIIObject *)(op))->state.kind)
is (I think) an indirection and a C-level cast);
#define PyUnicode_DATA(op) \
(assert(PyUnicode_Check(op)), \
PyUnicode_IS_COMPACT(op) ? _PyUnicode_COMPACT_DATA(op) : \
_PyUnicode_NONCOMPACT_DATA(op))
Run Code Online (Sandbox Code Playgroud)
(which is also boring for similar reasons, assuming the macros (Something_CAPITALIZED
) are all fast),
#define PyUnicode_READ(kind, data, index) \
((Py_UCS4) \
((kind) == PyUnicode_1BYTE_KIND ? \
((const Py_UCS1 *)(data))[(index)] : \
((kind) == PyUnicode_2BYTE_KIND ? \
((const Py_UCS2 *)(data))[(index)] : \
((const Py_UCS4 *)(data))[(index)] \
) \
))
Run Code Online (Sandbox Code Playgroud)
(which involves indexes but really isn't slow at all) and
static PyObject*
get_latin1_char(unsigned char ch)
{
PyObject *unicode = unicode_latin1[ch];
if (!unicode) {
unicode = PyUnicode_New(1, ch);
if (!unicode)
return NULL;
PyUnicode_1BYTE_DATA(unicode)[0] = ch;
assert(_PyUnicode_CheckConsistency(unicode, 1));
unicode_latin1[ch] = unicode;
}
Py_INCREF(unicode);
return unicode;
}
Run Code Online (Sandbox Code Playgroud)
Which confirms my suspicion that:
This is cached:
PyObject *unicode = unicode_latin1[ch];
Run Code Online (Sandbox Code Playgroud)This should be fast. The if (!unicode)
is not run, so it's literally equivalent in this case to
PyObject *unicode = unicode_latin1[ch];
Py_INCREF(unicode);
return unicode;
Run Code Online (Sandbox Code Playgroud)Honestly, after testing the assert
s are fast (by disabling them [I think it works on the C-level asserts...]), the only plausibly-slow parts are:
PyUnicode_IS_COMPACT(op)
_PyUnicode_COMPACT_DATA(op)
_PyUnicode_NONCOMPACT_DATA(op)
Run Code Online (Sandbox Code Playgroud)
Which are:
#define PyUnicode_IS_COMPACT(op) \
(((PyASCIIObject*)(op))->state.compact)
Run Code Online (Sandbox Code Playgroud)
(fast, as before),
#define _PyUnicode_COMPACT_DATA(op) \
(PyUnicode_IS_ASCII(op) ? \
((void*)((PyASCIIObject*)(op) + 1)) : \
((void*)((PyCompactUnicodeObject*)(op) + 1)))
Run Code Online (Sandbox Code Playgroud)
(fast if the macro IS_ASCII
is fast), and
#define _PyUnicode_NONCOMPACT_DATA(op) \
(assert(((PyUnicodeObject*)(op))->data.any), \
((((PyUnicodeObject *)(op))->data.any)))
Run Code Online (Sandbox Code Playgroud)
(also fast as it's an assert plus an indirection plus a cast).
So we're down (the rabbit hole) to:
PyUnicode_IS_ASCII
Run Code Online (Sandbox Code Playgroud)
which is
#define PyUnicode_IS_ASCII(op) \
(assert(PyUnicode_Check(op)), \
assert(PyUnicode_IS_READY(op)), \
((PyASCIIObject*)op)->state.ascii)
Run Code Online (Sandbox Code Playgroud)
Hmm... that seems fast too...
Well, OK, but let's compare it to PyList_GetItem
. (Yeah, thanks Tim Peters for giving me more work to do :P.)
PyObject *
PyList_GetItem(PyObject *op, Py_ssize_t i)
{
if (!PyList_Check(op)) {
PyErr_BadInternalCall();
return NULL;
}
if (i < 0 || i >= Py_SIZE(op)) {
if (indexerr == NULL) {
indexerr = PyUnicode_FromString(
"list index out of range");
if (indexerr == NULL)
return NULL;
}
PyErr_SetObject(PyExc_IndexError, indexerr);
return NULL;
}
return ((PyListObject *)op) -> ob_item[i];
}
Run Code Online (Sandbox Code Playgroud)
We can see that on non-error cases this is just going to run:
PyList_Check(op)
Py_SIZE(op)
((PyListObject *)op) -> ob_item[i]
Run Code Online (Sandbox Code Playgroud)
Where PyList_Check
is
#define PyList_Check(op) \
PyType_FastSubclass(Py_TYPE(op), Py_TPFLAGS_LIST_SUBCLASS)
Run Code Online (Sandbox Code Playgroud)
(TABS! TABS!!!) (issue21587) That got fixed and merged in 5 minutes. Like... yeah. Damn. They put Skeet to shame.
#define Py_SIZE(ob) (((PyVarObject*)(ob))->ob_size)
Run Code Online (Sandbox Code Playgroud)
#define PyType_FastSubclass(t,f) PyType_HasFeature(t,f)
Run Code Online (Sandbox Code Playgroud)
#ifdef Py_LIMITED_API
#define PyType_HasFeature(t,f) ((PyType_GetFlags(t) & (f)) != 0)
#else
#define PyType_HasFeature(t,f) (((t)->tp_flags & (f)) != 0)
#endif
Run Code Online (Sandbox Code Playgroud)
So this is normally really trivial (two indirections and a couple of boolean checks) unless Py_LIMITED_API
is on, in which case... ???
Then there's the indexing and a cast (((PyListObject *)op) -> ob_item[i]
) and we're done.
So there are definitely fewer checks for lists, and the small speed differences certainly imply that it could be relevant.
I think in general, there's just more type-checking and indirection (->)
for Unicode. It seems I'm missing a point, but what?
Tim*_*ers 31
当你遍历大多数容器对象(列表,元组类型的字典,...),迭代器提供的对象在容器中.
但是当你遍历字符串时,必须为每个传递的字符创建一个新对象 - 字符串不是"容器",在同一意义上,列表是容器.在迭代创建这些对象之前,字符串中的各个字符不作为不同的对象存在.