Python:获取URL路径部分

Question

Python:获取URL路径部分

如何从网址获取特定的路径部分？例如,我想要一个对此进行操作的函数:

http://www.mydomain.com/hithere?image=2934

Run Code Online (Sandbox Code Playgroud)

并返回"hithere"

或对此进行操作:

http://www.mydomain.com/hithere/something/else

Run Code Online (Sandbox Code Playgroud)

并返回相同的东西("hithere")

我知道这可能会使用urllib或urllib2,但我无法从文档中找出如何仅获取路径的一部分.

Answer 1

Jos*_*Lee 38

使用urlparse提取URL的路径组件:

>>> import urlparse
>>> path = urlparse.urlparse('http://www.example.com/hithere/something/else').path
>>> path
'/hithere/something/else'

Run Code Online (Sandbox Code Playgroud)

使用os.path .split 将路径拆分为组件:

>>> import os.path
>>> os.path.split(path)
('/hithere/something', 'else')

Run Code Online (Sandbox Code Playgroud)

dirname和basename函数为您提供了两个分割; 也许在while循环中使用dirname:

>>> while os.path.dirname(path) != '/':
...     path = os.path.dirname(path)
... 
>>> path
'/hithere'

Run Code Online (Sandbox Code Playgroud)

不要将os.path.split用于url,因为它取决于平台.该代码将在Windows上失败,因为它期望\作为分隔符! (44认同)
os.path.split可能会起作用,但我认为在这里使用它是不好的做法,因为它显然是用于os路径而不是url路径. (6认同)
@Viorel这是不正确的.我刚刚测试过.使用`os.path.join`是错误的,因为它会使用错误的分隔符,但``split`方法仍然可以拆分为`/`.实际上,您可以使用`/`作为Python中的目录分隔符来键入Windows的所有目录路径.使用`/`作为目录分隔符可以在Windows上的许多地方使用,而不仅仅是在Python中. (4认同)
urllib是否具有任何无需执行一堆字符串分析/拆分/循环就可以执行此操作的功能？我以为会有捷径... (2认同)
在 Windows 上，对于包含 \ 的 URL，使用 `os.path` 将失败。改用`posixpath` - 请参阅我的答案。 (2认同)
为什么不直接使用`path.split("/")`呢？ (2认同)

Answer 2

Iwa*_*amp 17

最佳选择是在使用posixpathURL的路径组件时使用该模块.os.path当在基于POSIX和Windows NT的平台上使用时,此模块具有与POSIX路径相同的接口并始终如一地操作.

示例代码:

#!/usr/bin/env python3

import urllib.parse
import sys
import posixpath
import ntpath
import json

def path_parse( path_string, *, normalize = True, module = posixpath ):
    result = []
    if normalize:
        tmp = module.normpath( path_string )
    else:
        tmp = path_string
    while tmp != "/":
        ( tmp, item ) = module.split( tmp )
        result.insert( 0, item )
    return result

def dump_array( array ):
    string = "[ "
    for index, item in enumerate( array ):
        if index > 0:
            string += ", "
        string += "\"{}\"".format( item )
    string += " ]"
    return string

def test_url( url, *, normalize = True, module = posixpath ):
    url_parsed = urllib.parse.urlparse( url )
    path_parsed = path_parse( urllib.parse.unquote( url_parsed.path ),
        normalize=normalize, module=module )
    sys.stdout.write( "{}\n  --[n={},m={}]-->\n    {}\n".format( 
        url, normalize, module.__name__, dump_array( path_parsed ) ) )

test_url( "http://eg.com/hithere/something/else" )
test_url( "http://eg.com/hithere/something/else/" )
test_url( "http://eg.com/hithere/something/else/", normalize = False )
test_url( "http://eg.com/hithere/../else" )
test_url( "http://eg.com/hithere/../else", normalize = False )
test_url( "http://eg.com/hithere/../../else" )
test_url( "http://eg.com/hithere/../../else", normalize = False )
test_url( "http://eg.com/hithere/something/./else" )
test_url( "http://eg.com/hithere/something/./else", normalize = False )
test_url( "http://eg.com/hithere/something/./else/./" )
test_url( "http://eg.com/hithere/something/./else/./", normalize = False )

test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False )
test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False,
    module = ntpath )

Run Code Online (Sandbox Code Playgroud)

代码输出:

http://eg.com/hithere/something/else
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
  --[n=False,m=posixpath]-->
    [ "hithere", "something", "else", "" ]
http://eg.com/hithere/../else
  --[n=True,m=posixpath]-->
    [ "else" ]
http://eg.com/hithere/../else
  --[n=False,m=posixpath]-->
    [ "hithere", "..", "else" ]
http://eg.com/hithere/../../else
  --[n=True,m=posixpath]-->
    [ "else" ]
http://eg.com/hithere/../../else
  --[n=False,m=posixpath]-->
    [ "hithere", "..", "..", "else" ]
http://eg.com/hithere/something/./else
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else
  --[n=False,m=posixpath]-->
    [ "hithere", "something", ".", "else" ]
http://eg.com/hithere/something/./else/./
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else/./
  --[n=False,m=posixpath]-->
    [ "hithere", "something", ".", "else", ".", "" ]
http://eg.com/see%5C/if%5C/this%5C/works
  --[n=False,m=posixpath]-->
    [ "see\", "if\", "this\", "works" ]
http://eg.com/see%5C/if%5C/this%5C/works
  --[n=False,m=ntpath]-->
    [ "see", "if", "this", "works" ]

Run Code Online (Sandbox Code Playgroud)

笔记:

在基于Windows NT的平台os.path上ntpath
在基于Unix/Posix的平台os.path上posixpath
ntpath不会\正确处理反斜杠()(参见代码/输出中的最后两种情况) - 这posixpath就是推荐的原因.
记得要用 urllib.parse.unquote
考虑使用 posixpath.normpath
RFC 3986/未定义多个路径分隔符()的语义.然而,折叠多个相邻的路径分隔符(即它处理,并相同)posixpath//////
尽管POSIX和URL路径具有相似的语法和语义,但它们并不相同.

规范性参考文献:

Python 3.4+解决方案:`url_path = PurePosixPath(urllib.parse.unquote(urllib.parse.urlparse(url).path))`. (3认同)

Answer 3

Nav*_*vin 10

Python 3.4+解决方案:

from urllib.parse import unquote, urlparse
from pathlib import PurePosixPath

url = 'http://www.example.com/hithere/something/else'

PurePosixPath(
    unquote(
        urlparse(
            url
        ).path
    )
).parts[1]

# returns 'hithere' (the same for the URL with parameters)

# parts holds ('/', 'hithere', 'something', 'else')
#               0    1          2            3

Run Code Online (Sandbox Code Playgroud)

Answer 4

Azi*_*lto 8

注意 Python3 import 已更改为from urllib.parse import urlparseSee documentation。下面是一个例子：

>>> from urllib.parse import urlparse
>>> url = 's3://bucket.test/my/file/directory'
>>> p = urlparse(url)
>>> p
ParseResult(scheme='s3', netloc='bucket.test', path='/my/file/directory', params='', query='', fragment='')
>>> p.scheme
's3'
>>> p.netloc
'bucket.test'
>>> p.path
'/my/file/directory'

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，3 月前
查看次数：	40817 次
最近记录：	6 年，6 月前