taw*_*taw 11 compression sqlite
我有一个充满大量URL的sqlite数据库,它占用了大量的磁盘空间,访问它会导致许多磁盘搜索并且速度很慢.平均URL路径长度为97个字节(主机名重复很多,因此我将它们移动到外键表).压缩它们有什么好方法吗?大多数压缩算法适用于大文档,而不是平均少于100字节的"文档",但即使减少20%也非常有用.任何可行的压缩算法?没有任何标准.
ewa*_*she 11
使用压缩算法但使用共享字典.
在使用Unix压缩命令使用的LZC/LZW算法之前,我已经做过类似的事情.
使用短字符串获得良好压缩的技巧是使用由您正在压缩的URL的标准样本组成的字典.
你应该轻松获得20%.
编辑:LZC是LZW的变种.您只需要LZW,因为您只需要一个静态字典.LZC增加了对字典/表格填满后重置的支持.
我已经尝试过使用以下策略.它使用的是共享字典,但是解决方法python的zlib并不能让你访问字典本身.
首先,通过运行一堆训练字符串来初始化预训练的压缩器和解压缩器.扔掉输出字符串.
然后,使用经过训练的压缩器的副本来压缩每个小字符串,并使用解压缩程序的副本对它们进行解压缩.
这里是我的python代码(为丑陋的测试道歉):
import zlib
class Trained_short_string_compressor(object):
def __init__(self,
training_set,
bits = -zlib.MAX_WBITS,
compression = zlib.Z_DEFAULT_COMPRESSION,
scheme = zlib.DEFLATED):
# Use a negative number of bits, so the checksum is not included.
compressor = zlib.compressobj(compression,scheme,bits)
decompressor = zlib.decompressobj(bits)
junk_offset = 0
for line in training_set:
junk_offset += len(line)
# run the training line through the compressor and decompressor
junk_offset -= len(decompressor.decompress(compressor.compress(line)))
# use Z_SYNC_FLUSH. A full flush seems to detrain the compressor, and
# not flushing wastes space.
junk_offset -= len(decompressor.decompress(compressor.flush(zlib.Z_SYNC_FLUSH)))
self.junk_offset = junk_offset
self.compressor = compressor
self.decompressor = decompressor
def compress(self,s):
compressor = self.compressor.copy()
return compressor.compress(s)+compressor.flush()
def decompress(self,s):
decompressor = self.decompressor.copy()
return (decompressor.decompress(s)+decompressor.flush())[self.junk_offset:]
Run Code Online (Sandbox Code Playgroud)
测试它,我在一堆10,000个短的(50 - > 300个字符)unicode字符串上节省了30%以上.压缩和解压缩它们也需要大约6秒钟(相比之下,使用简单的zlib压缩/解压缩大约需要2秒).另一方面,简单的zlib压缩节省了大约5%,而不是30%.
def test_compress_small_strings():
lines =[l for l in gzip.open(fname)]
compressor=Trained_short_string_compressor(lines[:500])
import time
t = time.time()
s = 0.0
sc = 0.
for i in range(10000):
line = lines[1000+i] # use an offset, so you don't cheat and compress the training set
cl = compressor.compress(line)
ucl = compressor.decompress(cl)
s += len(line)
sc+=len(cl)
assert line == ucl
print 'compressed',i,'small strings in',time.time()-t,'with a ratio of',s0/s
print 'now, compare it ot a naive compression '
t = time.time()
for i in range(10000):
line = lines[1000+i]
cr = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION,zlib.DEFLATED,-zlib.MAX_WBITS)
cl=cr.compress(line)+cr.flush()
ucl = zlib.decompress(cl,-zlib.MAX_WBITS)
sc += len(cl)
assert line == ucl
print 'naive zlib compressed',i,'small strings in',time.time()-t, 'with a ratio of ',sc/s
Run Code Online (Sandbox Code Playgroud)
注意,如果删除它,它不会持久.如果你想要持久性,你必须记住训练集.