使用Python从URL获取子域

Mar*_*rko 10 python string url

例如,地址是:

Address = http://lol1.domain.com:8888/some/page

我想将子域保存到变量中,所以我可以这样做;

print SubAddr
>> lol1
Run Code Online (Sandbox Code Playgroud)

Dan*_*man 17

urlparse.urlparse将URL拆分为协议,位置,端口等.然后,您可以拆分位置.以获取子域.

url = urlparse.urlparse(address)
subdomain = url.hostname.split('.')[0]
Run Code Online (Sandbox Code Playgroud)

  • 如果是IP地址怎么办?如果它有二级子域怎么办? (4认同)
  • 在 python 3.x 中,您需要通过“from urllib.parse import urlparse”导入它 (4认同)
  • 这实际上是一个非常糟糕的答案.如果没有子域,则会失败,而是返回域.IP地址失败(确定,没问题),并且多个子域失败,例如`web.host1.google.com`. (3认同)
  • 子域名可能包含多个点,因此`api.test`也是有效的,请记住这一点.如果你想要一个好的包来做这个检查`https:// pypi.python.org/pypi/tldextract`. (2认同)

小智 12

tldextract使这项任务变得非常简单,如果您需要任何进一步的信息,您可以按照建议使用urlparse:

>> import tldextract
>> tldextract.extract("http://lol1.domain.com:8888/some/page"
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
>> tldextract.extract("http://sub.lol1.domain.com:8888/some/page"
ExtractResult(subdomain='sub.lol1', domain='domain', suffix='com')
>> urlparse.urlparse("http://sub.lol1.domain.com:8888/some/page")
ParseResult(scheme='http', netloc='sub.lol1.domain.com:8888', path='/some/page', params='', query='', fragment='')
Run Code Online (Sandbox Code Playgroud)

请注意,tldextract正确处理子域.


Aco*_*orn 5

修改版本的精彩答案:如何从URL中提取顶级域名(TLD)

您将需要此处的有效tld列表

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tldFile:
    tlds = [line.strip() for line in tldFile if line[0] not in "/\n"]

class DomainParts(object):
    def __init__(self, domain_parts, tld):
        self.domain = None
        self.subdomains = None
        self.tld = tld
        if domain_parts:
            self.domain = domain_parts[-1]
            if len(domain_parts) > 1:
                self.subdomains = domain_parts[:-1]

def get_domain_parts(url, tlds):
    urlElements = urlparse(url).hostname.split('.')
    # urlElements = ["abcde","co","uk"]
    for i in range(-len(urlElements),0):
        lastIElements = urlElements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(lastIElements) # abcde.co.uk, co.uk, uk
        wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.uk, *.uk, *
        exceptionCandidate = "!"+candidate

        # match tlds: 
        if (exceptionCandidate in tlds):
            return ".".join(urlElements[i:]) 
        if (candidate in tlds or wildcardCandidate in tlds):
            return DomainParts(urlElements[:i], '.'.join(urlElements[i:]))
            # returns ["abcde"]

    raise ValueError("Domain not in global list of TLDs")

domain_parts = get_domain_parts("http://sub2.sub1.example.co.uk:80",tlds)
print "Domain:", domain_parts.domain
print "Subdomains:", domain_parts.subdomains or "None"
print "TLD:", domain_parts.tld
Run Code Online (Sandbox Code Playgroud)

给你:

Domain: example
Subdomains: ['sub2', 'sub1']
TLD: co.uk