在Python中将字符串转换为ctypes.c_ubyte数组的有效方法

Question

在Python中将字符串转换为ctypes.c_ubyte数组的有效方法

我有一个20字节的字符串,我想将其转换为ctypes.c_ubyte数组以进行位字段操作.

 import ctypes
 str_bytes = '01234567890123456789'
 byte_arr = bytearray(str_bytes)
 raw_bytes = (ctypes.c_ubyte*20)(*(byte_arr))

Run Code Online (Sandbox Code Playgroud)

有没有办法避免为了演员而从str到bytearray的深拷贝？

或者,是否可以在没有深层复制的情况下将字符串转换为bytearray？(使用memoryview等技术？)

我使用的是Python 2.7.

表现结果:

使用eryksun和Brian Larsen的建议,这里是使用Ubuntu 12.04和Python 2.7的vbox VM下的基准测试.

method1使用我的原始帖子
method2使用ctype from_buffer_copy
method3使用ctype cast/POINTER
method4使用numpy

结果:

method1需要3.87秒
method2需要0.42秒
method3需要1.44秒
method4需要8.79秒

码:

import ctypes
import time
import numpy

str_bytes = '01234567890123456789'

def method1():
    result = ''
    t0 = time.clock()
    for x in xrange(0,1000000):     
        byte_arr = bytearray(str_bytes)
        result = (ctypes.c_ubyte*20)(*(byte_arr))

    t1 = time.clock()
    print(t1-t0)

    return result

def method2():

    result = ''
    t0 = time.clock()
    for x in xrange(0,1000000):     
        result = (ctypes.c_ubyte * 20).from_buffer_copy(str_bytes)

    t1 = time.clock()
    print(t1-t0)

    return result

def method3():

    result = ''
    t0 = time.clock()
    for x in xrange(0,1000000):     
        result = ctypes.cast(str_bytes, ctypes.POINTER(ctypes.c_ubyte * 20))[0]

    t1 = time.clock()
    print(t1-t0)

    return result

def method4():

    result = ''
    t0 = time.clock()
    for x in xrange(0,1000000):     
        arr = numpy.asarray(str_bytes)
        result = arr.ctypes.data_as(ctypes.POINTER(ctypes.c_ubyte*len(str_bytes)))

    t1 = time.clock()
    print(t1-t0)

    return result

print(method1())
print(method2())
print(method3())
print(method4())

Run Code Online (Sandbox Code Playgroud)

Answer 1

Ery*_*Sun 8

我不这样做你的想法.bytearray创建字符串的副本.然后解释器将bytearray序列解压缩到a中starargs tuple并将其合并到tuple具有其他args的另一个new 中(即使在这种情况下没有).最后,c_ubyte数组初始化器遍历args tuple以设置c_ubyte数组的元素.这需要大量的工作和大量的复制才能完成初始化阵列.

相反,您可以使用该from_buffer_copy方法,假设字符串是带缓冲区接口的字节串(不是unicode):

import ctypes    
str_bytes = '01234567890123456789'
raw_bytes = (ctypes.c_ubyte * 20).from_buffer_copy(str_bytes)

Run Code Online (Sandbox Code Playgroud)

这仍然需要复制字符串,但它只执行一次,效率更高.正如评论中所述,Python字符串是不可变的,可以实现或用作dict键.它的不变性应该得到尊重,即使ctypes允许你在实践中违反这一点:

>>> from ctypes import *
>>> s = '01234567890123456789'
>>> b = cast(s, POINTER(c_ubyte * 20))[0]
>>> b[0] = 97
>>> s
'a1234567890123456789'

Run Code Online (Sandbox Code Playgroud)

编辑

我需要强调的是,我不建议使用ctypes来修改不可变的CPython字符串.如果必须,则至少sys.getrefcount事先检查以确保引用计数为2或更少(呼叫加1).否则,您最终会对名称(例如"sys")和代码对象常量的字符串实习感到惊讶.Python可以自由地重用不可变对象.如果你走出语言来改变一个"不可变"的对象,你就违反了合同.

例如,如果修改已经散列的字符串,则缓存的散列不再适用于内容.这打破了它作为dict键使用.具有新内容的另一个字符串或具有原始内容的字符串都不会与字典中的键匹配.前者具有不同的哈希值,后者具有不同的值.然后,获取dict项的唯一方法是使用具有错误哈希的变异字符串.继续前一个例子:

>>> s
'a1234567890123456789'
>>> d = {s: 1}
>>> d[s]
1

>>> d['a1234567890123456789']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'a1234567890123456789'

>>> d['01234567890123456789']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: '01234567890123456789'

Run Code Online (Sandbox Code Playgroud)

如果密钥是一个在几十个地方重用的实习字符串,现在考虑一下这个问题.

对于性能分析,通常使用timeit模块.3.3之前,timeit.default_timer因平台而异.在POSIX系统上它是time.time,而在Windows上它是time.clock.

import timeit

setup = r'''
import ctypes, numpy
str_bytes = '01234567890123456789'
arr_t = ctypes.c_ubyte * 20
'''

methods = [
  'arr_t(*bytearray(str_bytes))',
  'arr_t.from_buffer_copy(str_bytes)',
  'ctypes.cast(str_bytes, ctypes.POINTER(arr_t))[0]',
  'numpy.asarray(str_bytes).ctypes.data_as('
      'ctypes.POINTER(arr_t))[0]',
]

test = lambda m: min(timeit.repeat(m, setup))

Run Code Online (Sandbox Code Playgroud)

>>> tabs = [test(m) for m in methods]
>>> trel = [t / tabs[0] for t in tabs]
>>> trel
[1.0, 0.060573711879182784, 0.261847116395079, 1.5389279092185282]

Run Code Online (Sandbox Code Playgroud)

这非常有用！谢谢 (2认同)

归档时间：	12 年前
查看次数：	11941 次
最近记录：	12 年前