使用javascript从亚马逊URL抓取ASIN

ras*_*t22 15 javascript screen-scraping amazon-ec2

假设我有一个像这样的亚马逊产品网址

http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C/ref=amb_link_86123711_2?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-1&pf_rd_r=0AY9N5GXRYHCADJP5P0V&pf_rd_t=101&pf_rd_p=500528151&pf_rd_i=507846
Run Code Online (Sandbox Code Playgroud)

我怎么能用javascript 刮掉ASIN呢?谢谢!

Gum*_*mbo 21

由于ASIN在斜线后面始终是10个字母和/或数字的序列,请尝试以下操作:

url.match("/([a-zA-Z0-9]{10})(?:[/?]|$)")
Run Code Online (Sandbox Code Playgroud)

(?:[/?]|$)ASIN之后的附加是确保仅采用完整路径段.

  • 有多种情况不起作用:http://www.amazon.com/BEAUTBRIDE-Womens-Beaded-Wedding-Fingerless/dp/B010Q0Y92I ... http://www.amazon.com/LOSLANDIFEN-Elegant -Stiletto的婚礼6041-04Silk42/DP/B019PMTJH8.我可以确认,因为我使用了类似的方法:) (2认同)

jps*_*ons 20

亚马逊的详细信息页面可以有多种形式,因此要彻底检查它们.这些都是等价的:

http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C
http://www.amazon.com/dp/B0015T963C
http://www.amazon.com/gp/product/ B0015T963C
http://www.amazon.com/gp/product/glance/B0015T963C

它们总是看起来像这样或者这样:

http://www.amazon.com/<SEO STRING>/dp/<VIEW>/ASIN
http://www.amazon.com/gp/product/<VIEW>/ASIN
Run Code Online (Sandbox Code Playgroud)

这应该这样做:

var url = "http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C";
var regex = RegExp("http://www.amazon.com/([\\w-]+/)?(dp|gp/product)/(\\w+/)?(\\w{10})");
m = url.match(regex);
if (m) { 
    alert("ASIN=" + m[4]);
}
Run Code Online (Sandbox Code Playgroud)

  • 在此基础上,并添加对国际字符,奇数端口,https,非美国域和查询/跟踪参数(我使用Java)的支持,它将是:Pattern asinPattern = Pattern .compile("^(http [秒]://)([\\ w.-] +)(:????[0-9] +)/([\\ W-%] + /)(DP | GP /产品| EXEC /奥比多斯/ ASIN)/(\\ W + /)(\\瓦特{10})(*)$")??; (6认同)
  • 另一种可能的形式:amazon.com/exec/obidos/asin/B0015T963C.为了完全全面,可以使用`dp | gp/product | exec/obidos/asin`扩展正则表达式. (2认同)

osm*_*man 9

实际上,如果它像amazon.com/BlackBerry那样,那么最佳答案是行不通的......(因为黑莓也是10个字符).

一种解决方法(假设ASIN总是大写,因为它总是从亚马逊获取)是(在Ruby中):

        url.match("/([A-Z0-9]{10})")
Run Code Online (Sandbox Code Playgroud)

我发现它可以处理成千上万的URL.


Cha*_*kin 5

在所有情况下,上述所有方法均无效。我尝试了以下网址以与上面的示例匹配:

http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C
http://www.amazon.com/dp/B0015T963C
http://www.amazon.com/gp/product/B0015T963C
http://www.amazon.com/gp/product/glance/B0015T963C

https://www.amazon.de/gp/product/B00LGAQ7NW/ref=s9u_simh_gw_i1?ie=UTF8&pd_rd_i=B00LGAQ7NW&pd_rd_r=5GP2JGPPBAXXP8935Q61&pd_rd_w=gzhaa&pd_rd_wg=HBg7f&pf_rd_m=A3JWKAKR8XB7XF&pf_rd_s=&pf_rd_r=GA7GB6X6K6WMJC6WQ9RB&pf_rd_t=36701&pf_rd_p=c210947d-c955-4398-98aa-d1dc27e614f1&pf_rd_i=desktop

https://www.amazon.de/Sawyer-Wasserfilter-Wasseraufbereitung-Outdoor-Filter/dp/B00FA2RLX2/ref=pd_sim_200_3?_encoding=UTF8&psc=1&refRID=NMR7SMXJAKC4B3MH0HTN

https://www.amazon.de/Notverpflegung-Kg-Marine-wasserdicht-verpackt/dp/B01DFJTYSQ/ref=pd_sim_200_5?_encoding=UTF8&psc=1&refRID=7QM8MPC16XYBAZMJNMA4

https://www.amazon.de/dp/B01N32MQOA?psc=1
Run Code Online (Sandbox Code Playgroud)

这是我能想到的最好的方法:(?:[/dp/]|$)([A-Z0-9]{10}) 在所有情况下,该方法还将选择前置/。然后可以将其删除。

您可以在以下网址进行测试:http : //regexr.com/3gk2s