pek*_*pek 127 regex language-agnostic url
给定URL(单行):http:
//test.example.com/dir/subdir/file.html
如何使用正则表达式提取以下部分:
即使我输入以下URL,正则表达式也应该正常工作:
http://example.example.com/example/example/example.html
Run Code Online (Sandbox Code Playgroud)
hom*_*ast 139
一个正则表达式来解析和分解一个完整的URL,包括查询参数和锚点,例如
https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$RexEx职位:
url:RegExp ['$&'],
协议:正则表达式$ 2,
主持人:正则表达式$ 3
路径:正则表达式$ 4
文件:正则表达式$ 6
查询:正则表达式$ 7,
哈希:正则表达式$ 8
然后你可以很容易地进一步解析主机('.'分隔).
什么我会做的是使用这样的:
/*
^(.*:)//([A-Za-z0-9\-\.]+)(:[0-9]+)?(.*)$
*/
proto $1
host $2
port $3
the-rest $4
Run Code Online (Sandbox Code Playgroud)
进一步解析"其余"尽可能具体.在一个正则表达式中执行它有点疯狂.
Rob*_*Rob 79
我意识到我迟到了,但有一种简单的方法让浏览器在没有正则表达式的情况下为你解析一个url:
var a = document.createElement('a');
a.href = 'http://www.example.com:123/foo/bar.html?fox=trot#foo';
['href','protocol','host','hostname','port','pathname','search','hash'].forEach(function(k) {
console.log(k+':', a[k]);
});
/*//Output:
href: http://www.example.com:123/foo/bar.html?fox=trot#foo
protocol: http:
host: www.example.com:123
hostname: www.example.com
port: 123
pathname: /foo/bar.html
search: ?fox=trot
hash: #foo
*/
Run Code Online (Sandbox Code Playgroud)
gwg*_*gwg 58
我迟到了几年,但是我很惊讶没有人提到统一资源标识符规范有一节用正则表达式解析URI.伯纳斯 - 李等人撰写的正则表达式是:
Run Code Online (Sandbox Code Playgroud)^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? 12 3 4 5 6 7 8 9上面第二行中的数字只是为了提高可读性; 它们表示每个子表达的参考点(即每个配对括号).我们将子表达式匹配的值称为$.例如,将上面的表达式与之匹配
http://www.ics.uci.edu/pub/ietf/uri/#Related导致以下子表达式匹配:
Run Code Online (Sandbox Code Playgroud)$1 = http: $2 = http $3 = //www.ics.uci.edu $4 = www.ics.uci.edu $5 = /pub/ietf/uri/ $6 = <undefined> $7 = <undefined> $8 = #Related $9 = Related
为了它的价值,我发现我必须在JavaScript中逃避正斜杠:
^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
min*_*fai 31
我发现最高的投票答案(hometoast的答案)对我来说并不完美.两个问题:
以下是修改后的版本:
^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$
Run Code Online (Sandbox Code Playgroud)
零件位置如下:
int SCHEMA = 2, DOMAIN = 3, PORT = 5, PATH = 6, FILE = 8, QUERYSTRING = 9, HASH = 12
Run Code Online (Sandbox Code Playgroud)
编辑由anon用户发布:
function getFileName(path) {
return path.match(/^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/[\w\/-]+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$/i)[8];
}
Run Code Online (Sandbox Code Playgroud)
小智 11
我需要一个正则表达式来匹配所有网址并制作这个网址:
/(?:([^\:]*)\:\/\/)?(?:([^\:\@]*)(?:\:([^\@]*))?\@)?(?:([^\/\:]*)\.(?=[^\.\/\:]*\.[^\.\/\:]*))?([^\.\/\:]*)(?:\.([^\/\.\:]*))?(?:\:([0-9]*))?(\/[^\?#]*(?=.*?\/)\/)?([^\?#]*)?(?:\?([^#]*))?(?:#(.*))?/
Run Code Online (Sandbox Code Playgroud)
它匹配所有网址,任何协议,甚至网址
ftp://user:pass@www.cs.server.com:8080/dir1/dir2/file.php?param1=value1#hashtag
Run Code Online (Sandbox Code Playgroud)
结果(在JavaScript中)如下所示:
["ftp", "user", "pass", "www.cs", "server", "com", "8080", "/dir1/dir2/", "file.php", "param1=value1", "hashtag"]
Run Code Online (Sandbox Code Playgroud)
一个网址就像
mailto://admin@www.cs.server.com
Run Code Online (Sandbox Code Playgroud)
看起来像这样:
["mailto", "admin", undefined, "www.cs", "server", "com", undefined, undefined, undefined, undefined, undefined]
Run Code Online (Sandbox Code Playgroud)
我试图在javascript中解决这个问题,应该通过以下方式处理:
var url = new URL('http://a:b@example.com:890/path/wah@t/foo.js?foo=bar&bingobang=&king=kong@kong.com#foobar/bing/bo@ng?bang');
Run Code Online (Sandbox Code Playgroud)
因为(在Chrome中,至少)它解析为:
{
"hash": "#foobar/bing/bo@ng?bang",
"search": "?foo=bar&bingobang=&king=kong@kong.com",
"pathname": "/path/wah@t/foo.js",
"port": "890",
"hostname": "example.com",
"host": "example.com:890",
"password": "b",
"username": "a",
"protocol": "http:",
"origin": "http://example.com:890",
"href": "http://a:b@example.com:890/path/wah@t/foo.js?foo=bar&bingobang=&king=kong@kong.com#foobar/bing/bo@ng?bang"
}
Run Code Online (Sandbox Code Playgroud)
然而,这不是跨浏览器(https://developer.mozilla.org/en-US/docs/Web/API/URL),所以我鹅卵石此一起拉如上出相同的部件:
^(?:(?:(([^:\/#\?]+:)?(?:(?:\/\/)(?:(?:(?:([^:@\/#\?]+)(?:\:([^:@\/#\?]*))?)@)?(([^:\/#\?\]\[]+|\[[^\/\]@#?]+\])(?:\:([0-9]+))?))?)?)?((?:\/?(?:[^\/\?#]+\/+)*)(?:[^\?#]*)))?(\?[^#]+)?)(#.*)?
Run Code Online (Sandbox Code Playgroud)
信用这个正则表达式去https://gist.github.com/rpflorence谁张贴了这个jsperf http://jsperf.com/url-parsing(最初发现这里:https://gist.github.com/jlong/2428561 #comment-310066)谁想出了最初基于的正则表达式.
部件按此顺序:
var keys = [
"href", // http://user:pass@host.com:81/directory/file.ext?query=1#anchor
"origin", // http://user:pass@host.com:81
"protocol", // http:
"username", // user
"password", // pass
"host", // host.com:81
"hostname", // host.com
"port", // 81
"pathname", // /directory/file.ext
"search", // ?query=1
"hash" // #anchor
];
Run Code Online (Sandbox Code Playgroud)
还有一个小型库,它包装它并提供查询参数:
https://github.com/sadams/lite-url(也可在凉亭上使用)
如果你有改进,请创建一个带有更多测试的拉取请求,我将接受并合并谢谢.
提出一个更易读的解决方案(在Python中,但适用于任何正则表达式):
def url_path_to_dict(path):
pattern = (r'^'
r'((?P<schema>.+?)://)?'
r'((?P<user>.+?)(:(?P<password>.*?))?@)?'
r'(?P<host>.*?)'
r'(:(?P<port>\d+?))?'
r'(?P<path>/.*?)?'
r'(?P<query>[?].*?)?'
r'$'
)
regex = re.compile(pattern)
m = regex.match(path)
d = m.groupdict() if m is not None else None
return d
def main():
print url_path_to_dict('http://example.example.com/example/example/example.html')
Run Code Online (Sandbox Code Playgroud)
打印:
{
'host': 'example.example.com',
'user': None,
'path': '/example/example/example.html',
'query': None,
'password': None,
'port': None,
'schema': 'http'
}
Run Code Online (Sandbox Code Playgroud)
请尝试以下方法:
^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?
Run Code Online (Sandbox Code Playgroud)
它支持HTTP/FTP,子域,文件夹,文件等.
我通过快速谷歌搜索找到它:
http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx
子域和域很难,因为子域可以有几个部分,顶层域也可以,http://sub1.sub2.domain.co.uk/
the path without the file : http://[^/]+/((?:[^/]+/)*(?:[^/]+$)?)
the file : http://[^/]+/(?:[^/]+/)*((?:[^/.]+\.)+[^/.]+)$
the path with the file : http://[^/]+/(.*)
the URL without the path : (http://[^/]+/)
Run Code Online (Sandbox Code Playgroud)
(Markdown对正则表达式不是很友好)
这个改进版本应该像解析器一样可靠.
// Applies to URI, not just URL or URN:
// http://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Relationship_to_URL_and_URN
//
// http://labs.apache.org/webarch/uri/rfc/rfc3986.html#regexp
//
// (?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?
//
// http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax
//
// $@ matches the entire uri
// $1 matches scheme (ftp, http, mailto, mshelp, ymsgr, etc)
// $2 matches authority (host, user:pwd@host, etc)
// $3 matches path
// $4 matches query (http GET REST api, etc)
// $5 matches fragment (html anchor, etc)
//
// Match specific schemes, non-optional authority, disallow white-space so can delimit in text, and allow 'www.' w/o scheme
// Note the schemes must match ^[^\s|:/?#]+(?:\|[^\s|:/?#]+)*$
//
// (?:()(www\.[^\s/?#]+\.[^\s/?#]+)|(schemes)://([^\s/?#]*))([^\s?#]*)(?:\?([^\s#]*))?(#(\S*))?
//
// Validate the authority with an orthogonal RegExp, so the RegExp above won’t fail to match any valid urls.
function uriRegExp( flags, schemes/* = null*/, noSubMatches/* = false*/ )
{
if( !schemes )
schemes = '[^\\s:\/?#]+'
else if( !RegExp( /^[^\s|:\/?#]+(?:\|[^\s|:\/?#]+)*$/ ).test( schemes ) )
throw TypeError( 'expected URI schemes' )
return noSubMatches ? new RegExp( '(?:www\\.[^\\s/?#]+\\.[^\\s/?#]+|' + schemes + '://[^\\s/?#]*)[^\\s?#]*(?:\\?[^\\s#]*)?(?:#\\S*)?', flags ) :
new RegExp( '(?:()(www\\.[^\\s/?#]+\\.[^\\s/?#]+)|(' + schemes + ')://([^\\s/?#]*))([^\\s?#]*)(?:\\?([^\\s#]*))?(?:#(\\S*))?', flags )
}
// http://en.wikipedia.org/wiki/URI_scheme#Official_IANA-registered_schemes
function uriSchemesRegExp()
{
return 'about|callto|ftp|gtalk|http|https|irc|ircs|javascript|mailto|mshelp|sftp|ssh|steam|tel|view-source|ymsgr'
}
Run Code Online (Sandbox Code Playgroud)
const URI_RE = /^(([^:\/\s]+):\/?\/?([^\/\s@]*@)?([^\/@:]*)?:?(\d+)?)?(\/[^?]*)?(\?([^#]*))?(#[\s\S]*)?$/;
/**
* GROUP 1 ([scheme][authority][host][port])
* GROUP 2 (scheme)
* GROUP 3 (authority)
* GROUP 4 (host)
* GROUP 5 (port)
* GROUP 6 (path)
* GROUP 7 (?query)
* GROUP 8 (query)
* GROUP 9 (fragment)
*/
URI_RE.exec("https://john:doe@www.example.com:123/forum/questions/?tag=networking&order=newest#top");
URI_RE.exec("/forum/questions/?tag=networking&order=newest#top");
URI_RE.exec("ldap://[2001:db8::7]/c=GB?objectClass?one");
URI_RE.exec("mailto:John.Doe@example.com");
Run Code Online (Sandbox Code Playgroud)
在上面你可以找到使用修改后的正则表达式的 javascript 实现