Extracting URL for each city in a list using Rvest

Rau*_*aul 3 r rvest tidyverse

I have been exploring the rvest package and have a question regarding extracting urls from a list. My goal is to generate a df with the following headers: Country, City and the URL for the city. I already have a df with each of the countries and a list with the cities for each country.

My question is, how can I reference each city so that I can obtain its respective URL link? I am trying to reference the href inside the td class within "wikitable sortable jquery-tablesorter" but when I run links = webpage %>% html_node("href") %>% html_text() I only get the main URL.

Thanks for the suggestions!

# Get URL
url = "https://en.wikipedia.org/wiki/List_of_towns_and_cities_with_100,000_or_more_inhabitants/country:_A-B"

# Read the HTML code from the website
page = read_html(url)

# Get name of the countries
countries = page %>% html_nodes(".mw-headline") %>% html_text()

#Remove the last two items which are not countries
countries = as.tibble(countries) %>%
  slice(1:(n()-2))

#Add row number to each Country to left_join later
countries = rowid_to_column(countries, "column_label")

# Get cities for that country
# Still working on this since it includes the first table and I get blanks when I filter the html_nodes(".jquery-tablesorter td")
tables = html_nodes(page, "table")
tables = lapply(tables, html_table)

#Remove fist element which is not a city, only on the first page
tables = tables[-1]

#---WIP
# Get links for the cities, currently picks the main domain instead of the city
# Can I add a clause before the html node to indicate I want the href from "wikitable sortable jquery-tablesorter"?
links = page %>% html_attr("href") %>% html_text()
#---

#Remove the Providence and Population columns and keeps City and URL
tables = lapply(tables, "[", -c(2, 3))

#Standardize City as the column
tables = map(tables, set_names, "City")

# Flatten List
all <- bind_rows(tables, .id = "column_label") %>%
  mutate(column_label = as.integer(column_label)) %>%
  left_join(countries, by = "column_label")
Run Code Online (Sandbox Code Playgroud)

All*_*ron 6

Here's a fully reproducible example that gets you a table of the cities with their full url:

library(tidyverse)
library(rvest)

"https://en.wikipedia.org/wiki/" %>%
  paste0('List_of_towns_and_cities_with_100,000_or_more_inhabitants/') %>%
  paste0('country:_A-B') %>%
  read_html() %>%
  html_nodes(xpath = "//table/tbody/tr") %>%
  lapply(function(x) {
    node <- xml2::xml_find_first(x, 'td/a') 
    data.frame(city = html_attr(node, 'title'), 
               url = paste0("https://en.wikipedia.org/wiki",
                            html_attr(node, 'href')))}) %>%
  bind_rows() %>%
  remove_missing(na.rm = TRUE) %>%
  as_tibble()
#> # A tibble: 534 x 2
#>    city           url                                              
#>    <chr>          <chr>                                            
#>  1 Ghazni         https://en.wikipedia.org/wiki/wiki/Ghazni        
#>  2 Herat          https://en.wikipedia.org/wiki/wiki/Herat         
#>  3 Jalalabad      https://en.wikipedia.org/wiki/wiki/Jalalabad     
#>  4 Kabul          https://en.wikipedia.org/wiki/wiki/Kabul         
#>  5 Kandahar       https://en.wikipedia.org/wiki/wiki/Kandahar      
#>  6 Khost          https://en.wikipedia.org/wiki/wiki/Khost         
#>  7 Kunduz         https://en.wikipedia.org/wiki/wiki/Kunduz        
#>  8 Lashkargah     https://en.wikipedia.org/wiki/wiki/Lashkargah    
#>  9 Mazar-i-Sharif https://en.wikipedia.org/wiki/wiki/Mazar-i-Sharif
#> 10 Mihtarlam      https://en.wikipedia.org/wiki/wiki/Mihtarlam     
#> # ... with 524 more rows
Run Code Online (Sandbox Code Playgroud)

Created on 2023-01-06 with reprex v2.0.2