我有这个包含URL的示例向量.我的目标是获取URL的路径.
sample1 <- c("http://tercihblog.com/indirisu/docugard/", "http://funerariagomez.com/js/ggogle/a201209e3f79b740337b7bdb521630fe/",
"http://www.t-online.de/contacts/2015/08/atlas.html/", "http://mgracetimber.ie/wp-content/themes/Banner/db/box/",
"http://zamartrade.com/cs/DHL/DHL%20_%20Tracking.htm/", "http://dunhamengineering.com/menu/Auto-loadgoogleDrive/Document.Index/",
"http://www.indiegogo.com/guide/forum/2014/09/forgot-password/",
"http://raetc.com/wp-admin/Service/clients/votre-compte/en-ligne/imp-rem.fr/",
"http://www.lidanhang.com/img/?https://secure.runescape.com/m=weblogin/loginform.ws?mod=www&hwjklxlamp;ssl=0&dest/",
"http://www.sudaener.com/wp-includes/js/crop/dropbox/", "https://zeustracker.abuse.ch/blocklist.php/",
"https://zeustracker.abuse.ch/blocklist.php?download=hostsdeny/",
"https://zeustracker.abuse.ch/blocklist.php?download=iptablesblocklist/",
"https://zeustracker.abuse.ch/blocklist.php?download=snort/",
"https://zeustracker.abuse.ch/blocklist.php?download=squiddomain/"
)
Run Code Online (Sandbox Code Playgroud)
我最初的尝试是这样的:
gsub('http://[^/]+/','/',sample1)
Run Code Online (Sandbox Code Playgroud)
但是,这对于具有的URL不起作用https://.一个合适的解决方案是在第三次出现之前丢弃所有内容"/".我想知道如何使用regex这个以及如果有办法使用它substring.
谢谢
gsub因为代码更清晰,更直接,所以最好随身携带.
如果要在第3天之前删除所有内容/,请使用
> gsub('^(?:[^/]*/){3}','/',sample1)
[1] "/indirisu/docugard/"
[2] "/js/ggogle/a201209e3f79b740337b7bdb521630fe/"
[3] "/contacts/2015/08/atlas.html/"
[4] "/wp-content/themes/Banner/db/box/"
[5] "/cs/DHL/DHL%20_%20Tracking.htm/"
[6] "/menu/Auto-loadgoogleDrive/Document.Index/"
[7] "/guide/forum/2014/09/forgot-password/"
[8] "/wp-admin/Service/clients/votre-compte/en-ligne/imp-rem.fr/"
[9] "/img/?https://secure.runescape.com/m=weblogin/loginform.ws?mod=www&hwjklxlamp;ssl=0&dest/"
[10] "/wp-includes/js/crop/dropbox/"
[11] "/blocklist.php/"
[12] "/blocklist.php?download=hostsdeny/"
[13] "/blocklist.php?download=iptablesblocklist/"
[14] "/blocklist.php?download=snort/"
[15] "/blocklist.php?download=squiddomain/"
Run Code Online (Sandbox Code Playgroud)
本^(?:[^/]*/){3}场比赛:
^ - 字符串的开头(?:[^/]*/){3} - 正好3次出现:
[^/]* - 除零之外的零个或多个字符 //- 一个文字/字符.Cath建议你的正则表达式修正更准确,但也许,你想^在开头添加只在字符串开头匹配:
gsub('^https?://[^/]+/','/',sample1)
^ ^
Run Code Online (Sandbox Code Playgroud)
的?(贪婪)量词指一点或零的出现,从而使s后http可选的.它与(但效率更高)相同gsub('^(https|http)://[^/]+/','/',sample1).
你可能还想让你的正则表达式不区分大小写,添加ignore.case = TRUE.