Ben*_*oyt 28 python unicode url utf-8
我想知道什么是最好的方法 - 或者如果标准库有一个简单的方法 - 将域名和路径中的Unicode字符转换为等效的ASCII URL,使用域编码为IDNA和路径% -encoded,根据RFC 3986.
我从用户那里得到一个UTF-8的URL.因此,如果他们输入http://?.ws/?我'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'的Python.我想要的是ASCII版本:'http://xn--hgi.ws/%E2%99%A5'.
我现在所做的是通过正则表达式将URL拆分为多个部分,然后手动对域进行IDNA编码,并使用不同的urllib.quote()调用单独编码路径和查询字符串.
# url is UTF-8 here, eg: url = u'http://?.ws/?'.encode('utf-8')
match = re.match(r'([a-z]{3,5})://(.+\.[a-z0-9]{1,6})'
r'(:\d{1,5})?(/.*?)(\?.*)?$', url, flags=re.I)
if not match:
raise BadURLException(url)
protocol, domain, port, path, query = match.groups()
try:
domain = unicode(domain, 'utf-8')
except UnicodeDecodeError:
return '' # bad UTF-8 chars in domain
domain = domain.encode('idna')
if port is None:
port = ''
path = urllib.quote(path)
if query is None:
query = ''
else:
query = urllib.quote(query, safe='=&?/')
url = protocol + '://' + domain + port + path + query
# url is ASCII here, eg: url = 'http://xn--hgi.ws/%E3%89%8C'
Run Code Online (Sandbox Code Playgroud)
它是否正确?有更好的建议吗?是否有简单的标准库函数来执行此操作?
Mar*_*rot 45
import urlparse, urllib
def fixurl(url):
# turn string into unicode
if not isinstance(url,unicode):
url = url.decode('utf8')
# parse it
parsed = urlparse.urlsplit(url)
# divide the netloc further
userpass,at,hostport = parsed.netloc.rpartition('@')
user,colon1,pass_ = userpass.partition(':')
host,colon2,port = hostport.partition(':')
# encode each component
scheme = parsed.scheme.encode('utf8')
user = urllib.quote(user.encode('utf8'))
colon1 = colon1.encode('utf8')
pass_ = urllib.quote(pass_.encode('utf8'))
at = at.encode('utf8')
host = host.encode('idna')
colon2 = colon2.encode('utf8')
port = port.encode('utf8')
path = '/'.join( # could be encoded slashes!
urllib.quote(urllib.unquote(pce).encode('utf8'),'')
for pce in parsed.path.split('/')
)
query = urllib.quote(urllib.unquote(parsed.query).encode('utf8'),'=&?/')
fragment = urllib.quote(urllib.unquote(parsed.fragment).encode('utf8'))
# put it back together
netloc = ''.join((user,colon1,pass_,at,host,colon2,port))
return urlparse.urlunsplit((scheme,netloc,path,query,fragment))
print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5')
print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/%2F')
print fixurl(u'http://Åsa:abc123@?.ws:81/admin')
print fixurl(u'http://?.ws/admin')
Run Code Online (Sandbox Code Playgroud)
http://xn--hgi.ws/%E2%99%A5
http://xn--hgi.ws/%E2%99%A5/%2F
http://%C3%85sa:abc123@xn--hgi.ws:81/admin
http://xn--hgi.ws/admin
urlparse/ urlunparse到urlsplit/ urlunsplit.小智 5
MizardX给出的代码不是100%正确。这个例子行不通:
example.com/folder/?page=2
签出django.utils.encoding.iri_to_uri()将Unicode URL转换为ASCII URL。
http://docs.djangoproject.com/en/dev/ref/unicode/
| 归档时间: |
|
| 查看次数: |
22829 次 |
| 最近记录: |