Mar*_*rko 10 python string url
例如,地址是:
Address = http://lol1.domain.com:8888/some/page
我想将子域保存到变量中,所以我可以这样做;
print SubAddr
>> lol1
Run Code Online (Sandbox Code Playgroud)
Dan*_*man 17
urlparse.urlparse将URL拆分为协议,位置,端口等.然后,您可以拆分位置.以获取子域.
url = urlparse.urlparse(address)
subdomain = url.hostname.split('.')[0]
Run Code Online (Sandbox Code Playgroud)
小智 12
包tldextract使这项任务变得非常简单,如果您需要任何进一步的信息,您可以按照建议使用urlparse:
>> import tldextract
>> tldextract.extract("http://lol1.domain.com:8888/some/page"
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
>> tldextract.extract("http://sub.lol1.domain.com:8888/some/page"
ExtractResult(subdomain='sub.lol1', domain='domain', suffix='com')
>> urlparse.urlparse("http://sub.lol1.domain.com:8888/some/page")
ParseResult(scheme='http', netloc='sub.lol1.domain.com:8888', path='/some/page', params='', query='', fragment='')
Run Code Online (Sandbox Code Playgroud)
请注意,tldextract正确处理子域.
修改版本的精彩答案:如何从URL中提取顶级域名(TLD)
您将需要此处的有效tld列表
from __future__ import with_statement
from urlparse import urlparse
# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tldFile:
tlds = [line.strip() for line in tldFile if line[0] not in "/\n"]
class DomainParts(object):
def __init__(self, domain_parts, tld):
self.domain = None
self.subdomains = None
self.tld = tld
if domain_parts:
self.domain = domain_parts[-1]
if len(domain_parts) > 1:
self.subdomains = domain_parts[:-1]
def get_domain_parts(url, tlds):
urlElements = urlparse(url).hostname.split('.')
# urlElements = ["abcde","co","uk"]
for i in range(-len(urlElements),0):
lastIElements = urlElements[i:]
# i=-3: ["abcde","co","uk"]
# i=-2: ["co","uk"]
# i=-1: ["uk"] etc
candidate = ".".join(lastIElements) # abcde.co.uk, co.uk, uk
wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.uk, *.uk, *
exceptionCandidate = "!"+candidate
# match tlds:
if (exceptionCandidate in tlds):
return ".".join(urlElements[i:])
if (candidate in tlds or wildcardCandidate in tlds):
return DomainParts(urlElements[:i], '.'.join(urlElements[i:]))
# returns ["abcde"]
raise ValueError("Domain not in global list of TLDs")
domain_parts = get_domain_parts("http://sub2.sub1.example.co.uk:80",tlds)
print "Domain:", domain_parts.domain
print "Subdomains:", domain_parts.subdomains or "None"
print "TLD:", domain_parts.tld
Run Code Online (Sandbox Code Playgroud)
给你:
Domain: example Subdomains: ['sub2', 'sub1'] TLD: co.uk