从给定的URL获取域名

Question

从给定的URL获取域名

给定一个URL,我想提取域名(它不应该包含'www'部分).网址可以包含http/https.这是我写的java代码.虽然它似乎工作正常,有没有更好的方法或有一些边缘情况,可能会失败.

public static String getDomainName(String url) throws MalformedURLException{
    if(!url.startsWith("http") && !url.startsWith("https")){
         url = "http://" + url;
    }        
    URL netUrl = new URL(url);
    String host = netUrl.getHost();
    if(host.startsWith("www")){
        host = host.substring("www".length()+1);
    }
    return host;
}

Run Code Online (Sandbox Code Playgroud)

输入:http://google.com/blah

输出:google.com

Answer 1

Mik*_*uel 264

如果要解析URL,请使用java.net.URI. java.net.URL有一堆问题 - 它的equals方法进行DNS查找,这意味着使用它的代码在与不受信任的输入一起使用时可能容易受到拒绝服务攻击.

"戈斯林先生 - 你为什么要把网址等于吮吸？" 解释了这样一个问题.只是养成使用的习惯java.net.URI.

public static String getDomainName(String url) throws URISyntaxException {
    URI uri = new URI(url);
    String domain = uri.getHost();
    return domain.startsWith("www.") ? domain.substring(4) : domain;
}

Run Code Online (Sandbox Code Playgroud)

应该做你想做的事.

虽然它似乎工作正常,有没有更好的方法或有一些边缘情况,可能会失败.

您编写的代码无法使用有效的URL:

httpfoo/bar- 具有以...开头的路径组件的相对URL http.
HTTP://example.com/ - 协议不区分大小写.
//example.com/ - 与主机的协议相对URL
www/foo - 具有以...开头的路径组件的相对URL www
wwwexample.com- 不以...开头www.但以...开头的域名www.

分层URL具有复杂的语法.如果你试图在没有仔细阅读RFC 3986的情况下推出自己的解析器,你可能会弄错它.只需使用内置于核心库中的那个.

如果您确实需要处理java.net.URI拒绝的混乱输入,请参阅RFC 3986附录B:

附录B.使用正则表达式解析URI引用

由于"first-match-wins"算法与POSIX正则表达式使用的"贪婪"消歧方法相同,因此使用正则表达式解析URI引用的潜在五个组件是很自然和平常的.

以下行是用于将格式正确的URI引用分解为其组件的正则表达式.
  ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
   12            3  4          5       6  7        8 9
Run Code Online (Sandbox Code Playgroud)
上面第二行中的数字只是为了提高可读性; 它们表示每个子表达的参考点(即每个配对括号).

也适用于URI netUrl = new URI("www.google.com"); netUrl.getHost()返回NULL.我想我仍然需要检查http://或https:// (8认同)
@Jitendra,我建议你不要修复它们.Java库人们已经为您完成了这项工作. (2认同)
@Jitendra,`www.google.com`是一个相对URL,其路径组件是`www.google.com`.例如,如果针对`http:// example.com /`解决,您将获得`http:// example.com/www.google.com`. (2认同)
如果URI主机包含特殊字符，则它将为null，例如：“öob.se” (2认同)

Answer 2

小智 70

import java.net.*;
import java.io.*;

public class ParseURL {
  public static void main(String[] args) throws Exception {

    URL aURL = new URL("http://example.com:80/docs/books/tutorial"
                       + "/index.html?name=networking#DOWNLOADING");

    System.out.println("protocol = " + aURL.getProtocol()); //http
    System.out.println("authority = " + aURL.getAuthority()); //example.com:80
    System.out.println("host = " + aURL.getHost()); //example.com
    System.out.println("port = " + aURL.getPort()); //80
    System.out.println("path = " + aURL.getPath()); //  /docs/books/tutorial/index.html
    System.out.println("query = " + aURL.getQuery()); //name=networking
    System.out.println("filename = " + aURL.getFile()); ///docs/books/tutorial/index.html?name=networking
    System.out.println("ref = " + aURL.getRef()); //DOWNLOADING
  }
}

Run Code Online (Sandbox Code Playgroud)

它来自https://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html (2认同)

Answer 3

Kir*_*rby 12

这是InternetDomainName.topPrivateDomain()在Guava中使用的简短线条:InternetDomainName.from(new URL(url).getHost()).topPrivateDomain().toString()

鉴于http://www.google.com/blah,那会给你google.com.或者,给定http://www.google.co.mx,它会给你google.co.mx.

正如Sa Qada 在这篇文章的另一个答案中评论的那样,之前已经提出了这个问题:从给定的URL中提取主域名.这个问题的最佳答案来自Satya,他建议Guava的InternetDomainName.topPrivateDomain()

public boolean isTopPrivateDomain()

指示此域名是否仅由一个子域组件后跟公共后缀组成.例如,google.com和foo.co.uk返回true,但www.google.com或co.uk不返回true.

警告:此方法的真实结果并不意味着域位于可作为主机寻址的最高级别,因为许多公共后缀也是可寻址的主机.例如,域bar.uk.com的公共后缀为uk.com,因此它将从此方法返回true.但是uk.com本身就是一个可寻址的主机.

此方法可用于确定域是否可能是可以设置cookie的最高级别,但即使这取决于各个浏览器的cookie控件实现.有关详细信息,请参阅RFC 2109.

将URL.getHost()其与原始帖子已包含的内容放在一起,可以为您提供:

import com.google.common.net.InternetDomainName;

import java.net.URL;

public class DomainNameMain {

  public static void main(final String... args) throws Exception {
    final String urlString = "http://www.google.com/blah";
    final URL url = new URL(urlString);
    final String host = url.getHost();
    final InternetDomainName name = InternetDomainName.from(host).topPrivateDomain();
    System.out.println(urlString);
    System.out.println(host);
    System.out.println(name);
  }
}

Run Code Online (Sandbox Code Playgroud)

Answer 4

Adi*_*ain 5

我写了一个方法（请参阅下文），该方法提取URL的域名并使用简单的String匹配。它实际上所做的是提取第一个"://"（0如果没有"://"包含，则为索引）和第一个后续"/"（或String.length()没有后继，或索引"/"）之间的位。剩余的前"www(_)*."一位被切掉。我敢肯定，在某些情况下这还不够好，但在大多数情况下应该足够好了！

迈克·塞缪尔（Mike Samuel）在上面的帖子中说，java.net.URI班级可以做到这一点（这是班级的首选java.net.URL），但是我在URI班级遇到了问题。值得注意的是，URI.getHost()如果url不包含方案（"http(s)"即位），则给出null值。

/**
 * Extracts the domain name from {@code url}
 * by means of String manipulation
 * rather than using the {@link URI} or {@link URL} class.
 *
 * @param url is non-null.
 * @return the domain name within {@code url}.
 */
public String getUrlDomainName(String url) {
  String domainName = new String(url);

  int index = domainName.indexOf("://");

  if (index != -1) {
    // keep everything after the "://"
    domainName = domainName.substring(index + 3);
  }

  index = domainName.indexOf('/');

  if (index != -1) {
    // keep everything before the '/'
    domainName = domainName.substring(0, index);
  }

  // check for and remove a preceding 'www'
  // followed by any sequence of characters (non-greedy)
  // followed by a '.'
  // from the beginning of the string
  domainName = domainName.replaceFirst("^www.*?\\.", "");

  return domainName;
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，10 月前
查看次数：	168766 次
最近记录：	6 年，10 月前