在Python中将Unicode URL转换为ASCII(UTF-8%转义)的最佳方法?

Ben*_*oyt 28 python unicode url utf-8

我想知道什么是最好的方法 - 或者如果标准库有一个简单的方法 - 将域名和路径中的Unicode字符转换为等效的ASCII URL,使用域编码为IDNA和路径% -encoded,根据RFC 3986.

我从用户那里得到一个UTF-8的URL.因此,如果他们输入http://?.ws/?'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'的Python.我想要的是ASCII版本:'http://xn--hgi.ws/%E2%99%A5'.

我现在所做的是通过正则表达式将URL拆分为多个部分,然后手动对域进行IDNA编码,并使用不同的urllib.quote()调用单独编码路径和查询字符串.

# url is UTF-8 here, eg: url = u'http://?.ws/?'.encode('utf-8')
match = re.match(r'([a-z]{3,5})://(.+\.[a-z0-9]{1,6})'
                 r'(:\d{1,5})?(/.*?)(\?.*)?$', url, flags=re.I)
if not match:
    raise BadURLException(url)
protocol, domain, port, path, query = match.groups()

try:
    domain = unicode(domain, 'utf-8')
except UnicodeDecodeError:
    return ''  # bad UTF-8 chars in domain
domain = domain.encode('idna')

if port is None:
    port = ''

path = urllib.quote(path)

if query is None:
    query = ''
else:
    query = urllib.quote(query, safe='=&?/')

url = protocol + '://' + domain + port + path + query
# url is ASCII here, eg: url = 'http://xn--hgi.ws/%E3%89%8C'
Run Code Online (Sandbox Code Playgroud)

它是否正确?有更好的建议吗?是否有简单的标准库函数来执行此操作?

Mar*_*rot 45

码:

import urlparse, urllib

def fixurl(url):
    # turn string into unicode
    if not isinstance(url,unicode):
        url = url.decode('utf8')

    # parse it
    parsed = urlparse.urlsplit(url)

    # divide the netloc further
    userpass,at,hostport = parsed.netloc.rpartition('@')
    user,colon1,pass_ = userpass.partition(':')
    host,colon2,port = hostport.partition(':')

    # encode each component
    scheme = parsed.scheme.encode('utf8')
    user = urllib.quote(user.encode('utf8'))
    colon1 = colon1.encode('utf8')
    pass_ = urllib.quote(pass_.encode('utf8'))
    at = at.encode('utf8')
    host = host.encode('idna')
    colon2 = colon2.encode('utf8')
    port = port.encode('utf8')
    path = '/'.join(  # could be encoded slashes!
        urllib.quote(urllib.unquote(pce).encode('utf8'),'')
        for pce in parsed.path.split('/')
    )
    query = urllib.quote(urllib.unquote(parsed.query).encode('utf8'),'=&?/')
    fragment = urllib.quote(urllib.unquote(parsed.fragment).encode('utf8'))

    # put it back together
    netloc = ''.join((user,colon1,pass_,at,host,colon2,port))
    return urlparse.urlunsplit((scheme,netloc,path,query,fragment))

print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5')
print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/%2F')
print fixurl(u'http://Åsa:abc123@?.ws:81/admin')
print fixurl(u'http://?.ws/admin')
Run Code Online (Sandbox Code Playgroud)

输出:

http://xn--hgi.ws/%E2%99%A5
http://xn--hgi.ws/%E2%99%A5/%2F
http://%C3%85sa:abc123@xn--hgi.ws:81/admin
http://xn--hgi.ws/admin

阅读更多:

编辑:

  • 修复了字符串中已引用字符的情况.
  • 改为urlparse/ urlunparseurlsplit/ urlunsplit.
  • 不要使用主机名对用户和端口信息进行编码.(谢谢Jehiah)
  • 当缺少"@"时,不要将主机/端口视为user/pass!(谢谢hupf)

  • 问题是'/'被认为是路径分隔符,而'%2F'则不是.如果我只是取消引用字符串,它们会变成同一个字符串.也许最好永远不要取消引用路径,并将所有现有的'%'编码为'%25'..? (2认同)
  • netloc!= domain,所以你应该首先从`user:pass @ domain:port`解析域,然后转换为idna (2认同)

小智 5

MizardX给出的代码不是100%正确。这个例子行不通:

example.com/folder/?page=2

签出django.utils.encoding.iri_to_uri()将Unicode URL转换为ASCII URL。

http://docs.djangoproject.com/en/dev/ref/unicode/