Ari*_*ant 1 json r dataframe data.table
我有一个带有158行和25列JSON格式的文本/ html文件,我一直在尝试将其转换为数据帧,以便我可以在.csv中编写它.我试过"rjson"和"jsonlite"包来读取数据,然后用两种方法将它转换成数据表
使用
library(jsonlite)
json_file = "projectslocations.html"
json_datan <- fromJSON(json_file)
Run Code Online (Sandbox Code Playgroud)数据结构只有一行,包含158个变量
2.使用jsonlite和data.table
library(jsonlite)
library(data.table)
json_dat <- fromJSON(json_file)
class(json_dat)
lst= rbindlist(json_dat, fill=TRUE)
Run Code Online (Sandbox Code Playgroud)
这显示了包含158行和25个变量的data.frame.但是我不能在csv中写这个数据帧甚至查看数据帧.
错误:
Error in FUN(X[[i]], ...) :
Invalid column: it has dimensions. Can't format it. If it's the result of data.table(table()), use as.data.table(table()) instead.
Run Code Online (Sandbox Code Playgroud)
原始数据可在此处获得
以下是我使用purrr包进行一些函数式编程以及dplyr包的数据重要性来挖掘数据的方法:
library(jsonlite)
library(purrr)
library(dplyr)
# load JSON data and parse to list in R
json_file = file("projects.txt")
json_data <- fromJSON(json_file, simplifyDataFrame = FALSE)[[1]]
# extract location data seperately and create a data.frame with a project id column
locations <-
json_data %>%
at_depth(1, "locations") %>%
at_depth(2, ~data.frame(.x, stringsAsFactors = FALSE)) %>%
map(~bind_rows(.x)) %>%
bind_rows(.id = "id")
# prefix 'location_' to all location fields
colnames(locations) <- paste0("location_", colnames(locations))
# extract all project data excluding location data and create a data.frame
projects <-
json_data %>%
map(function(x) {x$locations <- NULL; x}) %>%
map(~data.frame(as.list(unlist(.x)), stringsAsFactors = FALSE)) %>%
bind_rows()
# join project and location data to yield a final denormalised data structure
projects_and_locations <-
projects %>%
inner_join(locations, by = c('id' = 'location_id'))
# details of single row of final denormalised data.frame
str(projects_and_locations[1,])
# 'data.frame': 1 obs. of 32 variables:
# $ id : chr "P130343"
# $ project_name : chr "MENA- Desert Ecosystems and Livelihoods Knowledge Sharing an"
# $ pl : chr "Global Environment Project"
# $ fy : chr "2013"
# $ ca : chr "$1.00M"
# $ gpname : chr "Environment & Natural Resources"
# $ s : chr "Environment"
# $ ttl : chr "Taoufiq Bennouna"
# $ ttlupi : chr "000314228"
# $ sbc : chr "ENV"
# $ sbn : chr "Environment"
# $ boardapprovaldate : chr "23-May-2013"
# $ crd : chr "16-Feb-2012"
# $ dmd : chr ""
# $ ed : chr "10-Jun-2013"
# $ fdd : chr "04-Dec-2013"
# $ rcd : chr "31-Dec-2017"
# $ fc : chr "false"
# $ totalamt : chr "$1.00M"
# $ url : chr "http://www.worldbank.org/projects/P130343?lang=en"
# $ project_abstract.cdata: chr ""
# $ sector.Name : chr "Agriculture, fishing, and forestry"
# $ sector.code : chr "AX"
# $ countrycode : chr "5M"
# $ countryname : chr "Middle East and North Africa"
# $ location_geoLocId : chr "0002464470"
# $ location_url : chr "javascript:projectPopupInfo('P130343', '0002464470')"
# $ location_geoLocName : chr "Tunis"
# $ location_latitude : chr "36.8190"
# $ location_longitude : chr "10.1660"
# $ location_country : chr "TN"
# $ location_countryName : chr "Tunisia"
Run Code Online (Sandbox Code Playgroud)
第一个问题是无法简化数据,因为 json 不整洁:它的键(项目名称)中有数据。解决方法是在简化之前删除键名:
library(jsonlite)
mydata <- fromJSON('http://pastebin.com/raw/HS3YEQxZ', simplifyVector = FALSE)
project_names <- names(mydata$projects)
names(mydata$projects) = NULL
out <- jsonlite:::simplify(mydata, flatten = TRUE)
projects <- out$projects
projects$name <- project_names
Run Code Online (Sandbox Code Playgroud)
这将projects在其中获取正确的数据框形状的数据。但是,如果您查看结构,就会发现您有一个一对多的数据集:sector和locations列实际上有一个包含多行的嵌套数据框。
str(projects[1,])
Run Code Online (Sandbox Code Playgroud)
因此,您需要执行左连接操作以将其合并为一个简单的 2D 数据框。这本身就是一个问题,与 JSON 无关。
因为您有多个嵌套列,所以不清楚您期望输出的样子。用于tidyr::unnest对嵌套列之一进行左连接:
# Unnest 'locations' column
out <- tidyr::unnest(projects, locations)
names(out)
Run Code Online (Sandbox Code Playgroud)
请注意,在这种情况下会tidyr自动删除该sectors列,因为它与项目的左连接位置不兼容。