ras*_*t22 15 javascript screen-scraping amazon-ec2
假设我有一个像这样的亚马逊产品网址
http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C/ref=amb_link_86123711_2?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-1&pf_rd_r=0AY9N5GXRYHCADJP5P0V&pf_rd_t=101&pf_rd_p=500528151&pf_rd_i=507846
Run Code Online (Sandbox Code Playgroud)
我怎么能用javascript 刮掉ASIN呢?谢谢!
Gum*_*mbo 21
由于ASIN在斜线后面始终是10个字母和/或数字的序列,请尝试以下操作:
url.match("/([a-zA-Z0-9]{10})(?:[/?]|$)")
Run Code Online (Sandbox Code Playgroud)
(?:[/?]|$)ASIN之后的附加是确保仅采用完整路径段.
jps*_*ons 20
亚马逊的详细信息页面可以有多种形式,因此要彻底检查它们.这些都是等价的:
http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C
http://www.amazon.com/dp/B0015T963C
http://www.amazon.com/gp/product/ B0015T963C
http://www.amazon.com/gp/product/glance/B0015T963C
它们总是看起来像这样或者这样:
http://www.amazon.com/<SEO STRING>/dp/<VIEW>/ASIN
http://www.amazon.com/gp/product/<VIEW>/ASIN
Run Code Online (Sandbox Code Playgroud)
这应该这样做:
var url = "http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C";
var regex = RegExp("http://www.amazon.com/([\\w-]+/)?(dp|gp/product)/(\\w+/)?(\\w{10})");
m = url.match(regex);
if (m) {
alert("ASIN=" + m[4]);
}
Run Code Online (Sandbox Code Playgroud)
实际上,如果它像amazon.com/BlackBerry那样,那么最佳答案是行不通的......(因为黑莓也是10个字符).
一种解决方法(假设ASIN总是大写,因为它总是从亚马逊获取)是(在Ruby中):
url.match("/([A-Z0-9]{10})")
Run Code Online (Sandbox Code Playgroud)
我发现它可以处理成千上万的URL.
在所有情况下,上述所有方法均无效。我尝试了以下网址以与上面的示例匹配:
http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C
http://www.amazon.com/dp/B0015T963C
http://www.amazon.com/gp/product/B0015T963C
http://www.amazon.com/gp/product/glance/B0015T963C
https://www.amazon.de/gp/product/B00LGAQ7NW/ref=s9u_simh_gw_i1?ie=UTF8&pd_rd_i=B00LGAQ7NW&pd_rd_r=5GP2JGPPBAXXP8935Q61&pd_rd_w=gzhaa&pd_rd_wg=HBg7f&pf_rd_m=A3JWKAKR8XB7XF&pf_rd_s=&pf_rd_r=GA7GB6X6K6WMJC6WQ9RB&pf_rd_t=36701&pf_rd_p=c210947d-c955-4398-98aa-d1dc27e614f1&pf_rd_i=desktop
https://www.amazon.de/Sawyer-Wasserfilter-Wasseraufbereitung-Outdoor-Filter/dp/B00FA2RLX2/ref=pd_sim_200_3?_encoding=UTF8&psc=1&refRID=NMR7SMXJAKC4B3MH0HTN
https://www.amazon.de/Notverpflegung-Kg-Marine-wasserdicht-verpackt/dp/B01DFJTYSQ/ref=pd_sim_200_5?_encoding=UTF8&psc=1&refRID=7QM8MPC16XYBAZMJNMA4
https://www.amazon.de/dp/B01N32MQOA?psc=1
Run Code Online (Sandbox Code Playgroud)
这是我能想到的最好的方法:(?:[/dp/]|$)([A-Z0-9]{10})
在所有情况下,该方法还将选择前置/。然后可以将其删除。
您可以在以下网址进行测试:http : //regexr.com/3gk2s
| 归档时间: |
|
| 查看次数: |
15471 次 |
| 最近记录: |