我需要一个Java中的正则表达式,我可以用来从任何url中检索domain.tld部分.所以https://foo.com/bar,HTTP://www.foo.com#bar,http://bar.foo.com将全部返回foo.com.
我写了这个正则表达式,但它匹配整个网址
Pattern.compile("[.]?.*[.x][a-z]{2,3}");
Run Code Online (Sandbox Code Playgroud)
我不确定我是否匹配"." 性格正确.我试过了 "." 但我从netbeans得到一个错误.
更新:
tld不限于2或3个字符,http://www.foo.co.uk/bar应返回foo.co.uk.
jsa*_*msa 10
这比你想象的要难.您的示例https://foo.com/bar,其中有一个逗号,这是一个有效的URL字符.这是一篇关于一些麻烦的好文章:
https://blog.codinghorror.com/the-problem-with-urls/
https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])
Run Code Online (Sandbox Code Playgroud)
是一个很好的起点
关于此主题的"掌握正则表达式"中的一些列表:
http://regex.info/listing.cgi?ed=3&p=207
@sjobe
>>> import re
>>> pattern = r'https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])'
>>> url = re.compile(pattern)
>>> url.match('http://news.google.com/').groups()
('news.google.com/',)
>>> url.match('not a url').groups()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groups'
>>> url.match('http://google.com/').groups()
('google.com/',)
>>> url.match('http://google.com').groups()
('google.com',)
Run Code Online (Sandbox Code Playgroud)
对不起,这个例子是在python而不是java中,它更简短.Java需要一些无关的正则表达式的逃避.
我会使用java.net.URI类来提取主机名,然后使用正则表达式来提取主机uri的最后两部分.
import java.net.URI;
import java.net.URISyntaxException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RunIt {
public static void main(String[] args) throws URISyntaxException {
Pattern p = Pattern.compile(".*?([^.]+\\.[^.]+)");
String[] urls = new String[] {
"https://foo.com/bar",
"http://www.foo.com#bar",
"http://bar.foo.com"
};
for (String url:urls) {
URI uri = new URI(url);
//eg: uri.getHost() will return "www.foo.com"
Matcher m = p.matcher(uri.getHost());
if (m.matches()) {
System.out.println(m.group(1));
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
打印:
foo.com
foo.com
foo.com
Run Code Online (Sandbox Code Playgroud)
如果字符串包含有效的URL,那么您可以使用像(Perl引用)这样的正则表达式:
/^
(?:\w+:\/\/)?
[^:?#\/\s]*?
(
[^.\s]+
\.(?:[a-z]{2,}|co\.uk|org\.uk|ac\.uk|org\.au|com\.au|___etc___)
)
(?:[:?#\/]|$)
/xi;
Run Code Online (Sandbox Code Playgroud)
结果:
url: https://foo.com/bar
matched: foo.com
url: http://www.foo.com#bar
matched: foo.com
url: http://bar.foo.com
matched: foo.com
url: ftp://foo.com
matched: foo.com
url: ftp://www.foo.co.uk?bar
matched: foo.co.uk
url: ftp://www.foo.co.uk:8080/bar
matched: foo.co.uk
Run Code Online (Sandbox Code Playgroud)
对于Java,它将被引用如下:
"^(?:\\w+://)?[^:?#/\\s]*?([^.\\s]+\\.(?:[a-z]{2,}|co\\.uk|org\\.uk|ac\\.uk|org\\.au|com\\.au|___etc___))(?:[:?#/]|$)"
Run Code Online (Sandbox Code Playgroud)
当然你需要更换etc部分.
示例Perl脚本:
use strict;
my @test = qw(
https://foo.com/bar
http://www.foo.com#bar
http://bar.foo.com
ftp://foo.com
ftp://www.foo.co.uk?bar
ftp://www.foo.co.uk:8080/bar
);
for(@test){
print "url: $_\n";
/^
(?:\w+:\/\/)?
[^:?#\/\s]*?
(
[^.\s]+
\.(?:[a-z]{2,}|co\.uk|org\.uk|ac\.uk|org\.au|com\.au|___etc___)
)
(?:[:?#\/]|$)
/xi;
print "matched: $1\n";
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
22569 次 |
| 最近记录: |