如何压缩URL参数

Jan*_*rdt 55 javascript browser compression url single-page-application

假设我有一个单页应用程序,它使用第三方API作为内容.应用程序的逻辑仅在浏览器中,并且没有我可以写入的后端.

为了允许深度链接到应用程序的状态,我使用pushState来跟踪确定应用程序状态的一些变量(注意Ubersicht的公共版本还没有这样做).在这种情况下repos,labels,milestonesusername,show_open(布尔)和with_comments(布尔)和without_comments(布尔).URL格式是?label=label_1,label_2,label_3&repos=repo_1….值通常是嫌疑人,大致[a-zA-Z][a-zA-Z0-9_-]或任何布尔指标.

到现在为止还挺好.现在,因为查询字符串可能有点长而且不实用,并且我希望能够传递类似http://espy.github.io/ubersicht/?state=SOMOPAQUETOKENTHATLOSSLESSLYDECOMPRESSESINTOTHEORIGINALVALUES#hoodiehq的URL ,越短越好.

我的第一次尝试是使用一些类似zlib的算法(https://github.com/imaya/zlib.js)和@flipzagging指向antirez/smaz(https // github.com/antirez/smaz)听起来更适合短字符串(JavaScript版本在https://github.com/personalcomputer/smaz.js).

由于=&没有具体处理https://github.com/personalcomputer/smaz.js/blob/master/lib/smaz.js#L9,我们也许能够调整的东西有一点点.

此外,还有一个选项可以对固定表中的值进行编码,例如,参数的顺序是预定义的,我们需要跟踪的是实际值.例如,可能在smaz压缩之前a=hamster&b=cat变成7hamster3cat(长度+字符)或仓鼠|猫(值+ |).

还有什么我应该找的吗?

kur*_*eko 47

一种工作解决方案,将各种好的(或者我认为)想法放在一起

我这样做很有趣,主要是因为它让我有机会在PHP中实现一个霍夫曼编码器,我找不到令人满意的现有实现.

但是,如果您计划探索类似的路径,这可能会节省您一些时间.

Burrows-Wheeler +前移+霍夫曼变换

我不太确定BWT最适合你的输入.
这不是常规文本,因此重复出现的模式可能不会像源代码或普通英语那样频繁出现.

此外,动态霍夫曼代码必须与编码数据一起传递,对于非常短的输入字符串,这将严重损害压缩增益.

我可能错了,在这种情况下,我很乐意看到有人证明我是.

无论如何,我决定尝试另一种方法.

General principle

1) define a structure for your URL parameters and strip the constant part

for instance, starting from:

repos=aaa,bbb,ccc&
labels=ddd,eee,fff&
milestones=ggg,hhh,iii&
username=kkk&
show_open=0&
show_closed=1&
show_commented=1&
show_uncommented=0
Run Code Online (Sandbox Code Playgroud)

extract:

aaa,bbb,ccc|ddd,eee,fff|ggg,hhh,iii|kkk|0110
Run Code Online (Sandbox Code Playgroud)

where , and | act as string and/or field terminators, while boolean values don't need any.

2) define a static repartition of symbols based on the expected average input and derive a static Huffman code

Since transmitting a dynamic table would take more space than your initial string, I think the only way to achhieve any compression at all is to have a static huffman table.

However, you can use the structure of your data to your advantage to compute reasonable probabilities.

You can start with the repartition of letters in English or other languages and throw in a certain percentage of numbers and other punctuation signs.

使用动态霍夫曼编码进行测试,我看到压缩率为30%到50%.

这意味着使用静态表可能会产生.6压缩因子(将数据的长度减少1/3),而不是更多.

3)将这个二进制霍夫曼代码转换成URI可以处理的东西

该列表中有70个常规ASCII 7位字符

!'()*-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
Run Code Online (Sandbox Code Playgroud)

会给你一个大约30%的扩展因子,几乎不比base64编码好.

30%的扩张将破坏静态霍夫曼压缩的收益,所以这几乎不是一个选择!

但是,由于您控制编码客户端和服务器端,因此您可以使用任何不是URI保留字符的内容.

An interesting possiblity would be to complete the above set up to 256 with whatever unicode glyphs, which would allow to encode your binary data with the same number of URI-compliant characters, thus replacing a painful and slow bunch of long integer divisions with a lightning fast table lookup.

Structure description

The codec is meant to be used both client and server side, so it is essential that server and clients share a common data structure definition.

Since the interface is likely to evolve, it seems wise to store a version number for upward compatibility.

The interface definition will use a very minimalistic description language, like so:

v   1               // version number (between 0 and 63)
a   en              // alphabet used (English)
o   10              // 10% of digits and other punctuation characters
f   1               // 1% of uncompressed "foreign" characters
s 15:3 repos        // list of expeced 3 strings of average length 15
s 10:3 labels
s 8:3  milestones
s 10   username     // single string of average length 10
b      show_open    // boolean value
b      show_closed
b      show_commented
b      show_uncommented
Run Code Online (Sandbox Code Playgroud)

Each language supported will have a frequency table for all its used letters

digits and other computerish symbols like -, . or _ will have a global frequency, regardless of languages

separators (, and |) frequencies will be computed according to the number of lists and fields present in the structure.

All other "foreign" characters will be escaped with a specific code and encoded as plain UTF-8.

Implementation

The bidirectional conversion path is as follows:

list of fields <-> UTF-8 data stream <-> huffman codes <-> URI

Here is the main codec

include ('class.huffman.codec.php');
class IRI_prm_codec
{
    // available characters for IRI translation
    static private $translator = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöùúûüýþÿ??????????????????????????????????????????????????????????????????????????????????Œœ????????????Šš??????????????????????Ÿ????Žž???????";

    const VERSION_LEN = 6; // version number between 0 and 63

    // ========================================================================
    // constructs an encoder
    // ========================================================================
    public function __construct ($config)
    {
        $num_record_terminators = 0;
        $num_record_separators = 0;
        $num_text_sym = 0;

        // parse config file
        $lines = file($config, FILE_IGNORE_NEW_LINES|FILE_SKIP_EMPTY_LINES);
        foreach ($lines as $line)
        {
            list ($code, $val) = preg_split('/\s+/', $line, 2);
            switch ($code)
            {
            case 'v': $this->version = intval($val); break;
            case 'a': $alphabet = $val; break;
            case 'o': $percent_others = $val; break;
            case 'f': $percent_foreign = $val; break;
            case 'b':
                $this->type[$val] = 'b';
                break;
            case 's':
                list ($val, $field) = preg_split('/\s+/u', $val, 2);
                @list ($len,$num) = explode (':', $val);
                if (!$num) $num=1;
                $this->type[$field] = 's';
                $num_record_terminators++;
                $num_record_separators+=$num-1;
                $num_text_sym += $num*$len;
                break;

            default: throw new Exception ("Invalid config parameter $code");
            }
        }

        // compute symbol frequencies           
        $total = $num_record_terminators + $num_record_separators + $num_text_sym + 1;

        $num_chars = $num_text_sym * (100-($percent_others+$percent_foreign))/100;
        $num_sym = $num_text_sym * $percent_others/100;
        $num_foreign = $num_text_sym * $percent_foreign/100;

        $this->get_frequencies ($alphabet, $num_chars/$total);
        $this->set_frequencies (" .-_0123456789", $num_sym/$total);
        $this->set_frequencies ("|", $num_record_terminators/$total);
        $this->set_frequencies (",", $num_record_separators/$total);
        $this->set_frequencies ("\1", $num_foreign/$total);
        $this->set_frequencies ("\0", 1/$total);

        // create Huffman codec
        $this->huffman = new Huffman_codec();
        $this->huffman->make_code ($this->frequency);
    }

    // ------------------------------------------------------------------------
    // grab letter frequencies for a given language
    // ------------------------------------------------------------------------
    private function get_frequencies ($lang, $coef)
    {
        $coef /= 100;
        $frequs = file("$lang.dat", FILE_IGNORE_NEW_LINES|FILE_SKIP_EMPTY_LINES);
        foreach ($frequs as $line)
        {
            $vals = explode (" ", $line);
            $this->frequency[$vals[0]] = floatval ($vals[1]) * $coef;
        }
    }

    // ------------------------------------------------------------------------
    // set a given frequency for a group of symbols
    // ------------------------------------------------------------------------
    private function set_frequencies ($symbols, $coef)
    {
        $coef /= strlen ($symbols);
        for ($i = 0 ; $i != strlen($symbols) ; $i++) $this->frequency[$symbols[$i]] = $coef;
    }

    // ========================================================================
    // encodes a parameter block
    // ========================================================================
    public function encode($input)
    {
        // get back input values
        $bools = '';
        foreach (get_object_vars($input) as $prop => $val)
        {
            if (!isset ($this->type[$prop])) throw new Exception ("unknown property $prop");
            switch ($this->type[$prop])
            {
            case 'b': $bools .= $val ? '1' : '0'; break;
            case 's': $strings[] = $val; break;
            default: throw new Exception ("Uh oh... type ".$this->type[$prop]." not handled ?!?");
            }
        }

        // set version number and boolean values in front
        $prefix = sprintf ("%0".self::VERSION_LEN."b$bools", $this->version);

        // pass strings to our Huffman encoder
        $strings = implode ("|", $strings);
        $huff = $this->huffman->encode ($strings, $prefix, "UTF-8");

        // translate into IRI characters
        mb_internal_encoding("UTF-8");
        $res = '';
        for ($i = 0 ; $i != strlen($huff) ; $i++) $res .= mb_substr (self::$translator, ord($huff[$i]), 1);

        // done
        return $res;
    }

    // ========================================================================
    // decodes an IRI string into a lambda object
    // ========================================================================
    public function decode($input)
    {
        // convert IRI characters to binary
        mb_internal_encoding("UTF-8");
        $raw = '';
        $len = mb_strlen ($input);
        for ($i = 0 ; $i != $len ; $i++)
        {
            $c = mb_substr ($input, 0, 1);
            $input = mb_substr ($input, 1);
            $raw .= chr(mb_strpos (self::$translator, $c));
        }

        $this->bin = '';        

        // check version
        $version = $this->read_bits ($raw, self::VERSION_LEN);
        if ($version != $this->version) throw new Exception ("Version mismatch: expected {$this->version}, found $version");

        // read booleans
        foreach ($this->type as $field => $type)
            if ($type == 'b')
                $res->$field = $this->read_bits ($raw, 1) != 0;

        // decode strings
        $strings = explode ('|', $this->huffman->decode ($raw, $this->bin));
        $i = 0;
        foreach ($this->type as $field => $type) 
            if ($type == 's')
                $res->$field = $strings[$i++];

        // done
        return $res;
    }

    // ------------------------------------------------------------------------
    // reads raw bit blocks from a binary string
    // ------------------------------------------------------------------------
    private function read_bits (&$raw, $len)
    {
        while (strlen($this->bin) < $len)
        {
            if ($raw == '') throw new Exception ("premature end of input"); 
            $this->bin .= sprintf ("%08b", ord($raw[0]));
            $raw = substr($raw, 1);
        }
        $res = bindec (substr($this->bin, 0, $len));
        $this->bin = substr ($this->bin, $len);
        return $res;
    }
}
Run Code Online (Sandbox Code Playgroud)

The underlying Huffman codec

include ('class.huffman.dict.php');

class Huffman_codec
{
    public  $dict = null;

    // ========================================================================
    // encodes a string in a given string encoding (default: UTF-8)
    // ========================================================================
    public function encode($input, $prefix='', $encoding="UTF-8")
    {
        mb_internal_encoding($encoding);
        $bin = $prefix;
        $res = '';
        $input .= "\0";
        $len = mb_strlen ($input);
        while ($len--)
        {
            // get next input character
            $c = mb_substr ($input, 0, 1);
            $input = substr($input, strlen($c)); // avoid playing Schlemiel the painter

            // check for foreign characters
            if (isset($this->dict->code[$c]))
            {
                // output huffman code
                $bin .= $this->dict->code[$c];
            }
            else // foreign character
            {
                // escape sequence
                $lc = strlen($c);
                $bin .= $this->dict->code["\1"] 
                     . sprintf("%02b", $lc-1); // character length (1 to 4)

                // output plain character
                for ($i=0 ; $i != $lc ; $i++) $bin .= sprintf("%08b", ord($c[$i]));
            }

            // convert code to binary
            while (strlen($bin) >= 8)
            {
                $res .= chr(bindec(substr ($bin, 0, 8)));
                $bin = substr($bin, 8);
            }
        }

        // output last byte if needed
        if (strlen($bin) > 0)
        {
            $bin .= str_repeat ('0', 8-strlen($bin));
            $res .= chr(bindec($bin));
        }

        // done
        return $res;
    }

    // ========================================================================
    // decodes a string (will be in the string encoding used during encoding)
    // ========================================================================
    public function decode($input, $prefix='')
    {
        $bin = $prefix;
        $res = '';
        $len = strlen($input);
        for ($i=0 ;;)
        {
            $c = $this->dict->symbol($bin);

            switch ((string)$c)
            {
            case "\0": // end of input
                break 2;

            case "\1": // plain character

                // get char byte size
                if (strlen($bin) < 2)
                {
                    if ($i == $len) throw new Exception ("incomplete escape sequence"); 
                    $bin .= sprintf ("%08b", ord($input[$i++]));
                }
                $lc = 1 + bindec(substr($bin,0,2));
                $bin = substr($bin,2);
                // get char bytes
                while ($lc--)
                {
                    if ($i == $len) throw new Exception ("incomplete escape sequence"); 
                    $bin .= sprintf ("%08b", ord($input[$i++]));
                    $res .= chr(bindec(substr($bin, 0, 8)));
                    $bin = substr ($bin, 8);
                }
                break;

            case null: // not enough bits do decode further

                // get more input
                if ($i == $len) throw new Exception ("no end of input mark found"); 
                $bin .= sprintf ("%08b", ord($input[$i++]));
                break;

            default:  // huffman encoded

                $res .= $c;
                break;          
            }
        }

        if (bindec ($bin) != 0) throw new Exception ("trailing bits in input");
        return $res;
    }

    // ========================================================================
    // builds a huffman code from an input string or frequency table
    // ========================================================================
    public function make_code ($input, $encoding="UTF-8")
    {
        if (is_string ($input))
        {
            // make dynamic table from the input message
            mb_internal_encoding($encoding);
            $frequency = array();
            while ($input != '')
            {
                $c = mb_substr ($input, 0, 1);
                $input = mb_substr ($input, 1);
                if (isset ($frequency[$c])) $frequency[$c]++; else $frequency[$c]=1;
            }
            $this->dict = new Huffman_dict ($frequency);
        }
        else // assume $input is an array of symbol-indexed frequencies
        {
            $this->dict = new Huffman_dict ($input);
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

And the huffman dictionary

class Huffman_dict
{
    public  $code = array();

    // ========================================================================
    // constructs a dictionnary from an array of frequencies indexed by symbols
    // ========================================================================
    public function __construct ($frequency = array())
    {
        // add terminator and escape symbols
        if (!isset ($frequency["\0"])) $frequency["\0"] = 1e-100;
        if (!isset ($frequency["\1"])) $frequency["\1"] = 1e-100;

        // sort symbols by increasing frequencies
        asort ($frequency);

        // create an initial array of (frequency, symbol) pairs
        foreach ($frequency as $symbol => $frequence) $occurences[] = array ($frequence, $symbol);

        while (count($occurences) > 1)
        {
            $leaf1 = array_shift($occurences);
            $leaf2 = array_shift($occurences);
            $occurences[] = array($leaf1[0] + $leaf2[0], array($leaf1, $leaf2));
            sort($occurences);
        }
        $this->tree = $this->build($occurences[0], '');

    }

    // -----------------------------------------------------------
    // recursive build of lookup tree and symbol[code] table
    // -----------------------------------------------------------
    private function build ($node, $prefix)
    {
        if (is_array($node[1]))
        {
            return array (
                '0' => $this->build ($node[1][0], $prefix.'0'),
                '1' => $this->build ($node[1][1], $prefix.'1'));
        }
        else
        {
            $this->code[$node[1]] = $prefix;
            return $node[1];
        }
    }

    // ===========================================================
    // extracts a symbol from a code stream
    // if found     : updates code stream and returns symbol
    // if not found : returns null and leave stream intact
    // ===========================================================
    public function symbol(&$code_stream)
    {
        list ($symbol, $code) = $this->get_symbol ($this->tree, $code_stream);
        if ($symbol !== null) $code_stream = $code;
        return $symbol;
    }

    // -----------------------------------------------------------
    // recursive search for a symbol from an huffman code
    // -----------------------------------------------------------
    private function get_symbol ($node, $code)
    {
        if (is_array($node))
        {
            if ($code == '') return null;
            return $this->get_symbol ($node[$code[0]], substr($code, 1));
        }
        return array ($node, $code);
    }
}
Run Code Online (Sandbox Code Playgroud)

Example

include ('class.iriprm.codec.php');

$iri = new IRI_prm_codec ("config.txt");
foreach (array (
    'repos' => "discussion,documentation,hoodie-cli",
    'labels' => "enhancement,release-0.3.0,starter",
    'milestones' => "1.0.0,1.1.0,v0.7",
    'username' => "mklappstuhl",
    'show_open' => false,
    'show_closed' => true,
    'show_commented' => true,
    'show_uncommented' => false
) as $prop => $val) $iri_prm->$prop = $val;

$encoded = $iri->encode ($iri_prm);
echo "encoded as $encoded\n";
$decoded = $iri->decode ($encoded);
var_dump($decoded);
Run Code Online (Sandbox Code Playgroud)

output:

encoded as 5???Ê?CO??????m?ZEÇŽÉ?óšüÿj??ÅìÇ?O?ä?Ï?í?É?QmìFOyä??qæŠ??Í?Æ??Ë?

object(stdClass)#7 (8) {
  ["show_open"]=>
  bool(false)
  ["show_closed"]=>
  bool(true)
  ["show_commented"]=>
  bool(true)
  ["show_uncommented"]=>
  bool(false)
  ["repos"]=>
  string(35) "discussion,documentation,hoodie-cli"
  ["labels"]=>
  string(33) "enhancement,release-0.3.0,starter"
  ["milestones"]=>
  string(16) "1.0.0,1.1.0,v0.7"
  ["username"]=>
  string(11) "mklappstuhl"
}
Run Code Online (Sandbox Code Playgroud)

In that example, the input got packed into 64 unicode characters, for an input length of about 100, yielding a 1/3 reduction.

An equivalent string:

discussion,documentation,hoodie-cli|enhancement,release-0.3.0,starter|
1.0.0,1.1.0,v0.7|mklappstuhl|0110
Run Code Online (Sandbox Code Playgroud)

Would be compressed by a dynamic Huffman table to 59 characters. Not much of a difference.

No doubt smart data reordering would reduce that, but then you would need to pass the dynamic table along...

Chinese to the rescue?

Drawing on ttepasse's idea, one could take advantage of the huge number of Asian characters to find a range of 0x4000 (12 bits) contiguous values, to code 3 bytes into 2 CJK characters, like so:

    // translate into IRI characters
    $res = '';
    $len = strlen ($huff);
    for ($i = 0 ; $i != $len ; $i++)
    {
        $byte = ord($huff[$i]);
        $quartet[2*$i  ] = $byte >> 4;
        $quartet[2*$i+1] = $byte &0xF;
    }
    $len *= 2;
    while ($len%3 != 0) $quartet[$len++] = 0;
    $len /= 3;
    for ($i = 0 ; $i != $len ; $i++)
    {
        $utf16 = 0x4E00 // CJK page base, enough range for 2**12 (0x4000) values
               + ($quartet[3*$i+0] << 8)
               + ($quartet[3*$i+1] << 4)
               + ($quartet[3*$i+2] << 0);
        $c = chr ($utf16 >> 8) . chr ($utf16 & 0xFF);
        $res .= $c;
    }
    $res = mb_convert_encoding ($res, "UTF-8", "UTF-16");
Run Code Online (Sandbox Code Playgroud)

and back:

    // convert IRI characters to binary
    $input = mb_convert_encoding ($input, "UTF-16", "UTF-8");
    $len = strlen ($input)/2;
    for ($i = 0 ; $i != $len ; $i++)
    {
        $val = (ord($input[2*$i  ]) << 8) + ord ($input[2*$i+1]) - 0x4E00;
        $quartet[3*$i+0] = ($val >> 8) &0xF;
        $quartet[3*$i+1] = ($val >> 4) &0xF;
        $quartet[3*$i+2] = ($val >> 0) &0xF;
    }
    $len *= 3;
    while ($len %2) $quartet[$len++] = 0;
    $len /= 2;
    $raw = '';
    for ($i = 0 ; $i != $len ; $i++)
    {
        $raw .= chr (($quartet[2*$i+0] << 4) + $quartet[2*$i+1]);
    }
Run Code Online (Sandbox Code Playgroud)

The previous output of 64 Latin chars

5???Ê?CO??????m?ZEÇŽÉ?óšüÿj??ÅìÇ?O?ä?Ï?í?É?QmìFOyä??qæŠ??Í?Æ??Ë?
Run Code Online (Sandbox Code Playgroud)

would "shrink" to 42 Asian characters:

??????????????????????????????????????????
Run Code Online (Sandbox Code Playgroud)

However, as you can see, the sheer bulk of your average ideogram makes the string actually longer (pixel-wise), so even if the idea was promising, the outcome is rather disappointing.

Picking thinner glyphs

On the other hand, you can try to pick "thin" characters as a base for URI encoding. For instance:

?????????iï??????????l???????????????????????????????;???;???i???
Run Code Online (Sandbox Code Playgroud)

instead of

?5???Ê?CO??????m?ZEÇŽÉ?óšüÿj??ÅìÇ?O?ä?Ï?í?É?QmìFOyä??qæŠ??Í?Æ??Ë??
Run Code Online (Sandbox Code Playgroud)

That will shrink the length by half with proportional fonts, including in a browser address bar.

My best candidate set of 256 "thin" glyphs so far:

??????????????????'???????|¦??????????????????????????????????????,. ???????????????????????????????????????????????????????????????????ª??/:;\ijltìíîï?????????????????????????????????????????????????????????????????????????????????????????????????????????
Run Code Online (Sandbox Code Playgroud)

Conclusion

This implementation should be ported to JavaScript to allow client-server exchange.
You should also provide a way to share the structure and Huffman codes with the clients.

It is not difficult and rather fun to do, but that means even more work :).

The Huffman gain in term of characters is around 30%.

Of course these characters are multibyte for the most part, but if you aim for the shortest URI it does not matter.
Except for the booleans that can easily be packed to 1 bit, those pesky strings seem rather reluctant to be compressed.
It might be possible to better tune the frequencies, but I doubt you will get above 50% compression rate.

On the other hand, picking thin glyphs does actually more to shrink the string.

So all in all the combination of both might indeed achieve something, though it's a lot of work for a modest result.

  • 你的帖子也值得赏金 - 我会回来的. (3认同)

小智 17

就像你自己提出的那样,我会首先摆脱所有没有携带任何信息的角色,因为它们是"格式"的一部分.

例如,将"labels = open,ssl,cypher&repository = 275643&username = ryanbrg&milestones =&with_comment = yes"改为"open,ssl,cyper | 275643 | ryanbrg || yes".

然后使用具有固定概率向量的霍夫曼编码(导致从字符到可变长度位串的固定映射 - 最可能的字符映射到较短的位串,较不可能的字符映射到较长的位串).

您甚至可以为不同的参数使用不同的概率向量.例如,在参数"labels"中,字母字符的概率很高,但在"repository"参数中,数字字符的概率最高.如果你这样做,你应该考虑分隔符"|" 前面参数的一部分.

最后将长位串(这是字符所映射到的所有位串的串联)转换成可以通过base64url编码放入URL的内容.

如果您可以向我发送一组有代表性的参数列表,我可以通过Huffmann编码器运行它们以查看它们的压缩程度.

概率向量(或等效地从字符到位串的映射)应该被编码为发送到浏览器的Javascript函数中的常量数组.

当然,您可以更进一步 - 例如 - 尝试获取可能的标签列表及其概率.然后你可以用霍夫曼编码将整个标签映射到位串.这将为您提供更好的压缩,但是您将为那些新的标签(例如,回退到单个字符编码)进行额外的工作,当然还有映射(如上所述 - 是Javascript函数中的常量数组) )会更大.


tte*_*sse 13

我有一个狡猾的计划!(和一杯杜松子酒补品)

您似乎并不关心字节流的长度,而是关注结果字形的长度,例如显示给用户的字符串.

浏览器非常适合将IRI转换为底层[URI] [2],同时仍在地址栏中显示IRI.IRI拥有更多可能的角色,而你的可能角色则相当有限.

这意味着您可以将字符(aa,ab,ac,...,zz和特殊字符)的bigrams编码为完整unicode频谱的一个字符.假设你有80个可能的ASCII字符:两个字符的可能组合数是6400.在Unicodes指定的字符中很容易找到,例如在汉语统一的CJK频谱中:

aa  ?  ?
ab  ?  ?
ac  ?  ?
ad  ?  ?
…
Run Code Online (Sandbox Code Playgroud)

我选择了CJK,因为如果目标字符在unicode中分配并且在主浏览器和操作系统上分配了字形,那么这只是(略微)合理.出于这个原因,私人使用区域已经出局,使用三元组(其可能的组合可能使用所有Unicodes 1114112可能的代码点)的更高效版本已经出局.

回顾一下:基础字节仍然存在 - 并且 - 给定UTF-8编码 - 可能更长,但用户看到的显示字符串和副本缩短了50%.

好的,好的,原因,为什么这个解决方案是疯了:

  • IRI并不完美.与现代浏览器相比,许多较小的工具存在问题.

  • 该算法显然需要更多的工作.你需要一个将bigrams映射到目标字符和后面的函数.并且最好在算术上避免在内存中使用大哈希表.

  • 应该检查目标字符是否已分配,以及它们是否是简单的字符,而不是花哨的unicodian事物,例如组合字符或在Unicode规范化中丢失的东西.此外,如果目标区域是带有字形的指定字符的连续跨度.

  • 浏览器有时会对IRI持谨慎态度.有充分理由,鉴于IDN的同形异义攻击.在地址栏中是否可以使用所有这些非ASCII字符?

  • 而最大的问题是:人们在用他们不知道的脚本中记住字符时出了名.他们在尝试(重新)键入这些字符时更糟糕.并且copy'n'paste可能会在许多不同的点击中出错.URL缩短器使用Base64甚至更小的字母表是有原因的.

......说到:那将是我的解决方案.将缩短链接的工作卸载到用户或通过其API集成goo.gl或bit.ly.


jgb*_*jgb 10

为什么不使用协议缓冲区

协议缓冲区是一种灵活,高效,自动化的机制,用于序列化结构化数据 - 想想XML,但更小,更快,更简单.您可以定义数据的结构化结构,然后使用特殊生成的源代码轻松地将结构化数据写入和读取各种数据流,并使用各种语言.您甚至可以更新数据结构,而不会破坏根据"旧"格式编译的已部署程序.

ProtoBuf.js将对象转换为协议缓冲消息和副节点.

以下对象转换为: CgFhCgFiCgFjEgFkEgFlEgFmGgFnGgFoGgFpIgNqZ2I=

{
    repos : ['a', 'b', 'c'],
    labels: ['d', 'e', 'f'],
    milestones : ['g', 'h', 'i'],
    username : 'jgb'
}
Run Code Online (Sandbox Code Playgroud)

以下示例使用require.js构建.尝试一下这个jsfiddle.

require.config({
    paths : {
        'Math/Long'  : '//rawgithub.com/dcodeIO/Long.js/master/Long.min',
        'ByteBuffer' : '//rawgithub.com/dcodeIO/ByteBuffer.js/master/ByteBuffer.min',
        'ProtoBuf'   : '//rawgithub.com/dcodeIO/ProtoBuf.js/master/ProtoBuf.min'
    }
})

require(['message'], function(message) {
    var data = {
        repos : ['a', 'b', 'c'],
        labels: ['d', 'e', 'f'],
        milestones : ['g', 'h', 'i'],
        username : 'jgb'
    }

    var request = new message.arguments(data);

    // Convert request data to base64
    var base64String = request.toBase64();
    console.log(base64String);

    // Convert base64 back
    var decodedRequest = message.arguments.decode64(base64String);
    console.log(decodedRequest);
});

// Protobuf message definition
// Message definition could also be stored in a .proto definition file
// See: https://github.com/dcodeIO/ProtoBuf.js/wiki
define('message', ['ProtoBuf'], function(ProtoBuf) {
    var proto = {
        package : 'message',
        messages : [
            {
                name : 'arguments',
                fields : [
                    {
                        rule : 'repeated',
                        type : 'string',
                        name : 'repos',
                        id : 1
                    },
                    {
                        rule : 'repeated',
                        type : 'string',
                        name : 'labels',
                        id : 2
                    },
                    {
                        rule : 'repeated',
                        type : 'string',
                        name : 'milestones',
                        id : 3
                    },
                    {
                        rule : 'required',
                        type : 'string',
                        name : 'username',
                        id : 4
                    },
                    {
                        rule : 'optional',
                        type : 'bool',
                        name : 'with_comments',
                        id : 5
                    },
                    {
                        rule : 'optional',
                        type : 'bool',
                        name : 'without_comments',
                        id : 6
                    }
                ],
            }
        ]
    };

    return ProtoBuf.loadJson(proto).build('message')
});
Run Code Online (Sandbox Code Playgroud)

  • Bounty因在Java中实现了真正的实现而获得了奖励,该实现似乎在缩短数据并将其返回。 (3认同)
  • jsfiddle 坏了。获取 https://rawgit.com/dcodeIO/protobuf.js/master/dist/protobuf-light.min.js 的 404 (3认同)

tho*_*chs 9

小提示:无论parseIntNumber#toString支持基数参数.尝试使用36的基数来编码URL中的数字(或索引到列表中).


w00*_*00t 7

更新:我发布了一个包含更多优化的 NPM 包,请参阅https://www.npmjs.com/package/@yaska-eu/jsurl2

还有一些提示:

  • Base64 编码为a..zA..Z0..9+/=未编码的 URI 字符a..zA..Z0..9-_.~. 所以Base64编码的结果只需要调剂+/=-_.,它不会扩大的URI。
  • 您可以保留一个键名数组,以便可以用第一个字符作为数组中的偏移量来表示对象,例如{foo:3,bar:{g:'hi'}}成为a3,b{c'hi'}给定的键数组['foo','bar','g']

有趣的图书馆:

  • JSUrl专门对 JSON 进行编码,因此即使它使用的字符多于 RFC 中指定的字符,也可以将其放入 URL 中而无需更改。{"name":"John Doe","age":42,"children":["Mary","Bill"]}变成~(name~'John*20Doe~age~42~children~(~'Mary~'Bill))并带有['name','age','children']可能是的密钥字典~(0~'John*20Doe~1~42~2~(~'Mary~'Bill)),因此从 101 个字节的 URI 编码到 38 个。
    • 占用空间小,速度快,压缩合理。
  • lz-string使用基于 LZW 的算法将字符串压缩为 UTF16 以存储在 localStorage 中。它还具有compressToEncodedURIComponent()生成 URI 安全输出的功能。
    • 仍然只有几 KB 的代码,相当快,压缩得很好/很好。

所以基本上我建议选择这两个库之一并考虑解决问题。


Ana*_*ran 6

这个问题有两个主要方面:编码和压缩。

通用压缩似乎不适用于小字符串。由于浏览器不提供任何 api 来压缩字符串,因此您还需要加载源代码,这可能是巨大的。

使用有效的编码可以保存大量字符。我写了一个名为? 处理编码和解码部分。这个想法是指定尽可能多的关于 url 参数的结构和域的信息作为规范。然后可以使用该规范来驱动编码和解码。例如,布尔值可以仅使用一位编码,整数可以转换为不同的 base(64) 从而减少所需的字符数,对象键不需要编码,因为它可以从规范中推断出来,枚举可以使用编码记录2 (numberOfAllowedValues) 位。

  • 我喜欢这个实现!在我们的项目中尝试过,似乎运行良好。现在只需要让它与角度路由一起工作:) (2认同)