Bar*_*ssk 4 html xpath r image web-scraping
我出于不同目的使用“rvest”包进行网络抓取。现在我需要使用它从谷歌图像获取图像对象(png)的来源。我已经尝试过此链接上的解决方案:Web scraping of image。它正是我想做的。所以我想出了下面的代码,但我的 html_nodes 函数得到空对象。
library("rvest")
page <- read_html("https://www.google.com.tr/search?q=manitou&espv=2&biw=1366&bih=662&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjCnJ6H2ITRAhWCQBoKHfQ5DUAQ_AUIBigB#tbm=isch&q=apple+logo+png")
node <- html_nodes(page,xpath='//*[@id="rg_s"]/div[1]/a/img')
src <- html_attr(node,"src")
Run Code Online (Sandbox Code Playgroud)
我还尝试了 css 选择器和图像的名称,因为它是在我上面给出的链接上完成的。我的节点对象在任何方面都是空的。我还应该指出,我想抓取链接上第一个图像的源,该图像具有我上面写的 xpath。先感谢您。
我认为它工作正常,只是您还没有足够了解该文件的构成,即可能没有与您编写的 xpath 选择器相对应的节点。
\n\n例如这里我选择所有<img>节点并将其打印出来:
library("rvest")\npage <- read_html("https://www.google.com.tr/search?q=manitou&espv=2&biw=1366&bih=662&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjCnJ6H2ITRAhWCQBoKHfQ5DUAQ_AUIBigB#tbm=isch&q=apple+logo+png")\nnode <- html_nodes(page,xpath = \'//img\')\nnode\nRun Code Online (Sandbox Code Playgroud)\n\n产量:
\n\n{xml_nodeset (21)}\n [1] <img style="padding-top:2px" src="/textinputassistant/tia.png" onclick="(function(){var text_input_assistant_js=\'/textinputassistant/11/tr_tia.js\';var s = document.createElement(\'s ...\n [2] <img height="113" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRg92_01ZbpYpV_agaHP4M3GoRoaCsZW5Sym8eqcXG8M1iJ8Nag1SXufq8" width="150" alt="manitou ile ilgili g\xc3\xb6rsel s ...\n [3] <img height="98" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSbJOecoEPbrJjZ-TjJMgMwlulXRMPLBWZX45vwUJNVXZk5MeY1chaZ07Y" width="143" alt="manitou ile ilgili g\xc3\xb6rsel so ...\n [4] <img height="79" src="https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcStpgymO--9B7R3O3OZJFrDsuOUuP94HwwNw-av9tUyjziG3sCl6M9s7G4" width="141" alt="manitou ile ilgili g\xc3\xb6rsel so ...\n [5] <img height="95" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTkibMqBWEifcyw_d-vrNob6UqYP-hDFPoQG2pkzVsP5bgmbReFWqyHjWA" width="143" alt="manitou ile ilgili g\xc3\xb6rsel so ...\n [6] <img height="91" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRhqrV1f--7QrQwovNBUHIpDFHe8Zwwad3UIvnwppv74GRIrsI1XYNPkFOg" width="150" alt="manitou ile ilgili g\xc3\xb6rsel s ...\n [7] <img height="112" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS1gpUEBucliP4WK2_22K4wElI2lIrDs2PZT7sRCLXK1Yxjg7DoQ2BtyLat" width="142" alt="manitou ile ilgili g\xc3\xb6rsel ...\n [8] <img height="69" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSssiUhuZe_1YmQ9dwmYHdKoFXyQBj9IQPGX_LU8msjekOvRRHDG9FmoaD_" width="140" alt="manitou ile ilgili g\xc3\xb6rsel s ...\n [9] <img height="113" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTCM9Mu6K63QpzNk20HFrHkybi--dw3JPu5JDd4LSEqz3UT5TBU5I0owLU" width="150" alt="manitou ile ilgili g\xc3\xb6rsel s ...\n[10] <img height="95" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcS8sI3fBSJmjftqC9Rx2bhXh_xgP3-nS2WuD2as9U_87SLxggQvmo2awDk" width="143" alt="manitou ile ilgili g\xc3\xb6rsel so ...\n[11] <img height="83" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT-gf45JbC4Q4lD3hioj_CP6imrO5RUWBeW6IuygNaN8LM1qydX56l5gFx4" width="148" alt="manitou ile ilgili g\xc3\xb6rsel s ...\n[12] <img height="84" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS6tnxPJYeS48IoNAlN0D52U5TNjmq7Ta-GcPNifM4_k40Y2D8LDj5-e-Wz" width="150" alt="manitou ile ilgili g\xc3\xb6rsel s ...\n[13] <img height="140" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTwmI9PxfLBT2dCPnR04I9pXmK8V9whAI2yEv4dX5qQq8G_JxHUAOwQB1mSTg" width="140" alt="manitou ile ilgili g\xc3\xb6rse ...\n[14] <img height="71" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQNx2Pe1AZtT-0XQ44HSurWO6O2syXrXG6YPfggtZsTHaf6YXuQlcmMOu0" width="150" alt="manitou ile ilgili g\xc3\xb6rsel so ...\n[15] <img height="130" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRGACLfeRm6U0xwSeYncSUDQtcd4noTewVF4aGnQcgz6TWYwwr917mjEtB6" width="113" alt="manitou ile ilgili g\xc3\xb6rsel ...\n[16] <img height="107" src="https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQ1RwAscQpzVXfquuAoPaLE9hFMuZSOpo6ckOzdpkTmg3KiswOIZIDTqrU" width="143" alt="manitou ile ilgili g\xc3\xb6rsel s ...\n[17] <img height="98" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcTE5sLf71TxAYla6nlfLRgXwL1IC-gXzXQRq1ZcnB21c5NXmQklJyNeqEs" width="148" alt="manitou ile ilgili g\xc3\xb6rsel so ...\n[18] <img height="91" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRHQjJ-Hc0Muy6Vjw5OlQZocflSCqR3oz0GBRu3Bs7_JCoNyjr5vjNP7KZ4" width="137" alt="manitou ile ilgili g\xc3\xb6rsel s ...\n[19] <img height="68" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcR8R_39V3bxWJUDdNhrsAS6YOYEg6U-QpaLEV0MQ5GBnVkeZa9lSB5MaGU" width="149" alt="manitou ile ilgili g\xc3\xb6rsel so ...\n[20] <img height="99" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTIrnwcUbo9WYT-gyvrLb5g4JFEc27odkzzU6SwzxrxvrsajRMD1OroUaY" width="116" alt="manitou ile ilgili g\xc3\xb6rsel so ...\n...\n> \nRun Code Online (Sandbox Code Playgroud)\n\n这是第一个节点:
\n\n>node[[1]]\n{xml_node} <img style="padding-top:2px"\nsrc="/textinputassistant/tia.png" onclick="(function(){var\n text_input_assistant_js=\'/textinputassistant/11/tr_tia.js\';var s =\n document.createElement(\'script\');s.src =\n text_input_assistant_js;(document.getElementById(\'xjsc\')||\n document.body).appendChild(s);})();" \n alt="" height="23" width="27">\nRun Code Online (Sandbox Code Playgroud)\n