PHP URL验证

Roh*_*pra 0 php regex

我知道有无数的线程提出这个问题,但我找不到能帮助我解决这个问题的线程.

我基本上试图解析大约10,000,000个URL的列表,确保它们符合以下条件,然后获取根域URL.此列表包含您可以想象的所有内容,包括(和预期的格式化URL)之类的内容:

biy.ly/test [VALID] [return - bit.ly]
example.com/apples?test=1&id=4 [VALID] [return - example.com]
host101.wow404.apples.test.com/cert/blah [VALID] [return - test.com]
101.121.44.xxx [**inVALID**] [return false]
localhost/noway [**inVALID**] [return false]
www.awesome.com [VALID] [return - awesome.com]
i am so awesome [**inVALID**] [return false]
http://404.mynewsite.com/visits/page/view/1/ [VALID] [return - mynewsite.com]
www1.151.com/searchresults [VALID] [return - 151.com]
Run Code Online (Sandbox Code Playgroud)

有没有人对此有任何建议?

Tom*_*lak 14

^(?:https?://)?(?:[a-z0-9-]+\.)*((?:[a-z0-9-]+\.)[a-z]+)
Run Code Online (Sandbox Code Playgroud)

说明

^                # start-of-line
(?:              # begin non-capturing group
  https?         #   "http" or "https"
  ://            #   "://"
)?               # end non-capturing group, make optional
(?:              # start non-capturing group
  [a-z0-9-]+\.   #   a name part (numbers, ASCII letters, dashes) & a dot
)*               # end non-capturing group, match as often as possible
(                # begin group 1 (this will be the domain name)
  (?:            #   start non-capturing group
    [a-z0-9-]+\. #     a name part, same as above
  )              #   end non-capturing group
  [a-z]+         #   the TLD
)                # end group 1 
Run Code Online (Sandbox Code Playgroud)

http://rubular.com/r/g6s9bQpNnC

  • 这次真是万分感谢.喜欢这个解释. (2认同)
  • 对于读者,请记住,网址可以包含非ascii字符.这个正则表达式不匹配`http://myurl.com/?utf8 =✓`见(http://rubular.com/r/I4fvV3VHVT).添加utf8参数是在旧版浏览器中强制使用utf8编码的技巧,请参阅(http://programmers.stackexchange.com/questions/168751/is-the-use-of-utf8-preferable-to-utf8-true) (2认同)