将Json文件读入没有嵌套列表的data.frame中

Mat*_*ewR 17 json r jsonlite

我正在尝试将json文件加载到r中的data.frame中.我对jsonlite包中的fromJSON函数运气不错 - 但我得到了嵌套列表,并且不确定如何将输入展平为二维data.frame.Jsonlite以data.frame的形式读取文件,但在一些变量中留下嵌套列表.

有人在使用嵌套列表读入时将任何JSON文件加载到data.frame有任何提示.

#*#*#*#*#*#*#*#*#*##*#*#*#*#*#*#*#*#*# HERE IS MY EXAMPLE #*#*#*#*#*#*#*#*#*##*#*#*#*#*#*#*#*#*#
# loads the packages
library("httr")
library( "jsonlite")

# downloads an example file
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=TRUE ) 

# the flatten function breaks the name variable into three vars ( first name, middle name, last name)
providers <- flatten( providers )

# but many of the columns are still lists:
sapply( providers , class)

# Some of these lists have a single level
head( providers$facility_type )

# Some have lot more than two - for example nine
providers[ , 6][[1]]
Run Code Online (Sandbox Code Playgroud)

我希望每个npi有一行,而不是单个列表的每个切片的单独列 - 这样数据框就有"plan_id_type","plan_id","network_tier"的cols九次,也许是colnames,从0到8我已经能够使用这个网站:http://www.convertcsv.com/json-to-csv.htm来获取这个文件的两个维度,但由于我做了数百个这样的工作,我希望能够动态地做.这是文件:http: //s000.tinyupload.com/download.php?file_id = 10808537503095762868&t = 1080853750309576286812811 - 我想使用fromJson函数获取具有此结构的文件作为data.frame加载

这是我尝试过的一些事情; 所以我想到了两种方法; 第一:使用不同的函数来读取Json文件,我已经看过了

rjson but that reads in a list
library( rjson )
providers <- fromJSON( getURL( "https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json") )
class( providers )
Run Code Online (Sandbox Code Playgroud)

我尝试过RJSONIO - 我试过这个将导入的json数据导入R中的数据帧

json-data-into-a-data-frame-in-r
library( RJSONIO )
providers <- fromJSON( getURL( "https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json") )

json_file <- lapply(providers, function(x) {
  x[sapply(x, is.null)] <- NA
  unlist(x)
})

# but When converting the lists to a data.frame I get an error
a <- do.call("rbind", json_file)
Run Code Online (Sandbox Code Playgroud)

所以,我尝试的第二种方法是将所有列表转换为我的data.frame中的变量

detach("package:RJSONIO", unload = TRUE )
detach("package:rjson", unload = TRUE )

library( "jsonlite")
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=TRUE ) 
providers <- flatten( providers )
Run Code Online (Sandbox Code Playgroud)

我可以拉出其中一个列表 - 但由于缺失,我无法合并回我的数据帧

a <- data.frame(Reduce(rbind,  providers$facility_type))
length( a ) == nrow( providers )
Run Code Online (Sandbox Code Playgroud)

我还尝试了以下建议:将嵌套列表转换为数据帧.和其他一些东西一样好,但没有运气

a <- sapply( providers$facility_type, unlist )
as.data.frame(t(sapply( providers$providers, unlist )) )
Run Code Online (Sandbox Code Playgroud)

任何帮助非常感谢

A5C*_*2T1 13

更新:2016年2月21日

col_fixer已更新为包含一个vec2col参数,该参数允许您将列表列展平为单个字符串或一组列.


data.frame您下载的内容中,我看到了几种不同的列类型.存在包含相同类型的载体的正常列.列表列可以是项目,也NULL可以是平面向量.有列表列,其中有data.frames作为列表元素.列表列包含data.frame与main相同的行数data.frame.

这是一个重新创建这些条件的示例数据集:

mydf <- data.frame(id = 1:3, type = c("A", "A", "B"), 
                   facility = I(list(c("x", "y"), NULL, "x")),
  address = I(list(data.frame(v1 = 1, v2 = 2, v4 = 3), 
                   data.frame(v1 = 1:2, v2 = 3:4, v3 = 5), 
                   data.frame(v1 = 1, v2 = NA, v3 = 3))))

mydf$person <- data.frame(name = c("AA", "BB", "CC"), age = c(20, 32, 23),
                          preference = c(TRUE, FALSE, TRUE))
Run Code Online (Sandbox Code Playgroud)

str示例的示例data.frame如下:

str(mydf)
## 'data.frame':    3 obs. of  5 variables:
##  $ id      : int  1 2 3
##  $ type    : Factor w/ 2 levels "A","B": 1 1 2
##  $ facility:List of 3
##   ..$ : chr  "x" "y"
##   ..$ : NULL
##   ..$ : chr "x"
##   ..- attr(*, "class")= chr "AsIs"
##  $ address :List of 3
##   ..$ :'data.frame': 1 obs. of  3 variables:
##   .. ..$ v1: num 1
##   .. ..$ v2: num 2
##   .. ..$ v4: num 3
##   ..$ :'data.frame': 2 obs. of  3 variables:
##   .. ..$ v1: int  1 2
##   .. ..$ v2: int  3 4
##   .. ..$ v3: num  5 5
##   ..$ :'data.frame': 1 obs. of  3 variables:
##   .. ..$ v1: num 1
##   .. ..$ v2: logi NA
##   .. ..$ v3: num 3
##   ..- attr(*, "class")= chr "AsIs"
##  $ person  :'data.frame':    3 obs. of  3 variables:
##   ..$ name      : Factor w/ 3 levels "AA","BB","CC": 1 2 3
##   ..$ age       : num  20 32 23
##   ..$ preference: logi  TRUE FALSE TRUE
## NULL
Run Code Online (Sandbox Code Playgroud)

你可以"扁平化"这种方法的一种方法是"修复"列表列.有三个修复.

  1. flatten (来自"jsonlite")将处理像"人物"列这样的列.
  2. 像"设施"列这样的列可以使用固定toString,它可以将每个元素转换为逗号分隔的项目,也可以将其转换为多个列.
  3. data.frames的列,有些有多行的列,首先需要展平成一行(通过转换为"宽"格式)然后需要作为单个绑定在一起data.table.(我正在使用"data.table"进行重新整形和将行绑定在一起).

我们可以使用如下函数来处理第二和第三点:

col_fixer <- function(x, vec2col = FALSE) {
  if (!is.list(x[[1]])) {
    if (isTRUE(vec2col)) {
      as.data.table(data.table::transpose(x))
    } else {
      vapply(x, toString, character(1L))
    }
  } else {
    temp <- rbindlist(x, use.names = TRUE, fill = TRUE, idcol = TRUE)
    temp[, .time := sequence(.N), by = .id]
    value_vars <- setdiff(names(temp), c(".id", ".time"))
    dcast(temp, .id ~ .time, value.var = value_vars)[, .id := NULL]
  }
}
Run Code Online (Sandbox Code Playgroud)

我们将把它和flatten函数集成到另一个可以完成大部分处理的函数中.

Flattener <- function(indf, vec2col = FALSE) {
  require(data.table)
  require(jsonlite)
  indf <- flatten(indf)
  listcolumns <- sapply(indf, is.list)
  newcols <- do.call(cbind, lapply(indf[listcolumns], col_fixer, vec2col))
  indf[listcolumns] <- list(NULL)
  cbind(indf, newcols)
}
Run Code Online (Sandbox Code Playgroud)

运行该功能给我们:

Flattener(mydf)
##   id type person.name person.age person.preference facility address.v1_1
## 1  1    A          AA         20              TRUE     x, y            1
## 2  2    A          BB         32             FALSE                     1
## 3  3    B          CC         23              TRUE        x            1
##   address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2 address.v3_1
## 1           NA            2           NA            3           NA           NA
## 2            2            3            4           NA           NA            5
## 3           NA           NA           NA           NA           NA            3
##   address.v3_2
## 1           NA
## 2            5
## 3           NA
Run Code Online (Sandbox Code Playgroud)

或者,将向量分成不同的列:

Flattener(mydf, TRUE)
##   id type person.name person.age person.preference facility.V1 facility.V2
## 1  1    A          AA         20              TRUE           x           y
## 2  2    A          BB         32             FALSE        <NA>        <NA>
## 3  3    B          CC         23              TRUE           x        <NA>
##   address.v1_1 address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2
## 1            1           NA            2           NA            3           NA
## 2            1            2            3            4           NA           NA
## 3            1           NA           NA           NA           NA           NA
##   address.v3_1 address.v3_2
## 1           NA           NA
## 2            5            5
## 3            3           NA
Run Code Online (Sandbox Code Playgroud)

这是str:

str(Flattener(mydf))
## 'data.frame':    3 obs. of  14 variables:
##  $ id               : int  1 2 3
##  $ type             : Factor w/ 2 levels "A","B": 1 1 2
##  $ person.name      : Factor w/ 3 levels "AA","BB","CC": 1 2 3
##  $ person.age       : num  20 32 23
##  $ person.preference: logi  TRUE FALSE TRUE
##  $ facility         : chr  "x, y" "" "x"
##  $ address.v1_1     : num  1 1 1
##  $ address.v1_2     : num  NA 2 NA
##  $ address.v2_1     : num  2 3 NA
##  $ address.v2_2     : num  NA 4 NA
##  $ address.v4_1     : num  3 NA NA
##  $ address.v4_2     : num  NA NA NA
##  $ address.v3_1     : num  NA 5 3
##  $ address.v3_2     : num  NA 5 NA
## NULL
Run Code Online (Sandbox Code Playgroud)

在您的"提供者"对象上,这可以非常快速一致地运行:

library(microbenchmark)
out <- microbenchmark(Flattener(providers), Flattener(providers, TRUE), flattenList(jsonRList))
out
# Unit: milliseconds
#                        expr        min         lq      mean    median        uq       max neval
#        Flattener(providers)  104.18939  126.59295  157.3744  138.4185  174.5222  308.5218   100
#  Flattener(providers, TRUE)   67.56471   86.37789  109.8921   96.3534  121.4443  301.4856   100
#      flattenList(jsonRList) 1780.44981 2065.50533 2485.1924 2269.4496 2694.1487 4397.4793   100

library(ggplot2)
qplot(y = time, data = out, colour = expr) ## Via @TylerRinker
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述


bgo*_*dst 11

我的第一步是通过RCurl::getURL()rjson::fromJSON()根据您的第二个代码示例加载数据:

##--------------------------------------
## libraries
##--------------------------------------
library(rjson);
library(RCurl);

##--------------------------------------
## get data
##--------------------------------------
URL <- 'https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json';
jsonRList <- fromJSON(getURL(URL)); ## recursive list representing the original JSON data
Run Code Online (Sandbox Code Playgroud)

接下来,为了深入理解数据的结构和清晰度,我编写了一组辅助函数:

##--------------------------------------
## helper functions
##--------------------------------------
## apply a function to a set of nodes at the same depth level in a recursive list structure
levelApply <- function(
    nodes, ## the root node of the list (recursive calls pass deeper nodes as they drill down into the list)
    keyList, ## another list, expected to hold a sequence of keys (component names, integer indexes, or NULL for all) specifying which nodes to select at each depth level
    func=identity, ## a function to run separately on each node once keyList has been exhausted
    ..., ## further arguments passed to func()
    joinFunc=NULL ## optional function for joining the return values of func() at each successive depth, as the stack is unwound. An alternative is calling unlist() on the result, but careful not to lose the top-level index association
) {
    if (length(keyList) == 0L) {
        ret <- if (is.null(nodes)) NULL else func(nodes,...)
    } else if (is.null(keyList[[1L]]) || length(keyList[[1L]]) != 1L) {
        ret <- lapply(if (is.null(keyList[[1L]])) nodes else nodes[keyList[[1L]]],levelApply,keyList[-1L],func,...,joinFunc=joinFunc);
        if (!is.null(joinFunc))
            ret <- do.call(joinFunc,ret);
    } else {
        ret <- levelApply(nodes[[keyList[[1L]]]],keyList[-1L],func,...,joinFunc=joinFunc);
    }; ## end if
    ret;
}; ## end if
## these two wrappers automatically attempt to simplify the results of func() to a vector or matrix/data.frame, respectively
levelApplyToVec <- function(...) levelApply(...,joinFunc=c);
levelApplyToFrame <- function(...) levelApply(...,joinFunc=rbind); ## can return matrix or data.frame, depending on ret
Run Code Online (Sandbox Code Playgroud)

理解上述内容的关键是keyList参数.假设你有一个这样的列表:

list(NULL,'addresses',2:3,'city')
Run Code Online (Sandbox Code Playgroud)

这将选择主列表所有元素下面的地址列表下面的第二个和第三个地址元素下面的所有城市字符串.

There are no built-in apply functions in R that can operate on such "parallel" node selections (rapply() is close, but no cigar), which is why I wrote my own. levelApply() finds each of the matching nodes and runs the given func() on it (default identity(), thus returning the node itself), returning the results to the caller, either joined as per joinFunc(), or in the same recursive list structure in which those nodes existed in the input list. Quick demo:

unname(levelApplyToVec(jsonRList,list(4L,'addresses',1:2,c('address','city'))));
## [1] "1001 Noble St"  "Fairbanks"      "1650 Cowles St" "Fairbanks"
Run Code Online (Sandbox Code Playgroud)

Here are the remaining helper functions I wrote in the process of working on this problem:

## for the given node selection key union, retrieve a data.frame of logicals representing the unique combinations of keys possessed by the selected nodes, possibly with a count
keyCombos <- function(node,keyList,allKeys) `rownames<-`(setNames(unique(as.data.frame(levelApplyToFrame(node,keyList,function(h) allKeys%in%names(h)))),allKeys),NULL);
keyCombosWithCount <- function(node,keyList,allKeys) { ks <- keyCombos(node,keyList,allKeys); ks$.count <- unname(apply(ks,1,function(combo) sum(levelApplyToVec(node,keyList,function(h) identical(sort(names(ks)[combo]),sort(names(h))))))); ks; };

## return a simple two-component list with type (list, namedlist, or atomic vector type) and len for non-namedlist types; tlStr() returns a nice stringified form of said list
tl <- function(e) { if (is.null(e)) return(NULL); ret <- typeof(e); if (ret == 'list' && !is.null(names(e))) ret <- list(type='namedlist') else ret <- list(type=ret,len=length(e)); ret; };
tlStr <- function(e) { if (is.null(e)) return(NA); ret <- tl(e); if (is.null(ret$len)) ret <- ret$type else ret <- paste0(ret$type,'[',ret$len,']'); ret; };

## stringification functions for display
mkcsv <- function(v) paste0(collapse=',',v);
keyListToStr <- function(keyList) paste0(collapse='','/',sapply(keyList,function(key) if (is.null(key)) '*' else paste0(collapse=',',key)));

## return a data.frame giving a comma-separated list of the unique types possessed by the selected nodes; useful for learning about the structure of the data
keyTypes <- function(node,keyList,allKeys) data.frame(key=allKeys,tl=sapply(allKeys,function(key) mkcsv(unique(na.omit(levelApplyToVec(node,c(keyList,key),tlStr))))),row.names=NULL);

## useful for testing; can call npiToFrame() to show the row with a specified npi value, in a nice vertical form
rowToFrame <- function(dfrow) data.frame(column=names(dfrow),value=c(as.matrix(dfrow)));
getNPIRow <- function(df,npi) which(df$npi == npi);
npiToFrame <- function(df,npi) rowToFrame(df[getNPIRow(df,npi),]);
Run Code Online (Sandbox Code Playgroud)

I've tried to capture the sequence of commands I ran against the data as I first examined it. Below are the results, showing the commands I ran, the command output, and leading comments describing what my intention was, and my conclusion from the output:

##--------------------------------------
## data examination
##--------------------------------------
## type of object -- plain unnamed list => array, length 3256
levelApplyToVec(jsonRList,list(),tlStr);
## [1] "list[3256]"

## unique types of main array elements => all named lists => hashes
unique(levelApplyToVec(jsonRList,list(NULL),tlStr));
## [1] "namedlist"

## get the union of keys among all hashes
allKeys <- unique(levelApplyToVec(jsonRList,list(NULL),names)); allKeys;
##  [1] "npi"             "type"            "facility_name"   "facility_type"   "addresses"       "plans"           "last_updated_on" "name"            "speciality"      "accepting"       "languages"       "gender"

## get the unique pattern of keys among all hashes, and how often each occurs => shows there are inconsistent key sets among the top-level hashes
keyCombosWithCount(jsonRList,list(NULL),allKeys);
##    npi type facility_name facility_type addresses plans last_updated_on  name speciality accepting languages gender .count
## 1 TRUE TRUE          TRUE          TRUE      TRUE  TRUE            TRUE FALSE      FALSE     FALSE     FALSE  FALSE    279
## 2 TRUE TRUE         FALSE         FALSE      TRUE  TRUE            TRUE  TRUE       TRUE      TRUE      TRUE   TRUE   2973
## 3 TRUE TRUE         FALSE         FALSE      TRUE  TRUE            TRUE  TRUE       TRUE      TRUE      TRUE  FALSE      4

## for each key, get the unique set of types it takes on among all hashes, ignoring hashes where the key is omitted => some scalar strings, some multi-string, addresses is a variable-length list, plans is length-9 list, and name is a hash
keyTypes(jsonRList,list(NULL),allKeys);
##                key                                                                                        tl
## 1              npi                                                                              character[1]
## 2             type                                                                              character[1]
## 3    facility_name                                                                              character[1]
## 4    facility_type                                                    character[1],character[2],character[3]
## 5        addresses list[1],list[2],list[3],list[6],list[5],list[7],list[4],list[8],list[9],list[13],list[12]
## 6            plans                                                                                   list[9]
## 7  last_updated_on                                                                              character[1]
## 8             name                                                                                 namedlist
## 9       speciality                                       character[1],character[2],character[3],character[4]
## 10       accepting                                                                              character[1]
## 11       languages                          character[2],character[3],character[4],character[6],character[5]
## 12          gender                                                                              character[1]

## must look deeper into addresses array, plans array, and name hash; we'll have to flatten them

## ==== addresses =====
## note: the addresses key is always present under main array elements
## unique types of address elements across all hashes => all named lists, thus nested hashes
unique(levelApplyToVec(jsonRList,list(NULL,'addresses',NULL),tlStr));
## [1] "namedlist"

## union of keys among all address element hashes
allAddressKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'addresses',NULL),names)); allAddressKeys;
## [1] "address"   "city"      "state"     "zip"       "phone"     "address_2"

## pattern of keys among address elements => only address_2 varies, similar frequency with it as without it
keyCombosWithCount(jsonRList,list(NULL,'addresses',NULL),allAddressKeys);
##   address city state  zip phone address_2 .count
## 1    TRUE TRUE  TRUE TRUE  TRUE     FALSE   1898
## 2    TRUE TRUE  TRUE TRUE  TRUE      TRUE   2575

## for each address element key, get the unique set of types it takes on among all hashes, ignoring hashes where the key (only address_2 in this case) is omitted => all scalar strings
keyTypes(jsonRList,list(NULL,'addresses',NULL),allAddressKeys);
##         key           tl
## 1   address character[1]
## 2      city character[1]
## 3     state character[1]
## 4       zip character[1]
## 5     phone character[1]
## 6 address_2 character[1]

## ==== plans =====
## note: the plans key is always present under main array elements
## unique types of plan elements across all hashes => all named lists, thus nested hashes
unique(levelApplyToVec(jsonRList,list(NULL,'plans',NULL),tlStr));
## [1] "namedlist"

## union of keys among all plan element hashes
allPlanKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'plans',NULL),names)); allPlanKeys;
## [1] "plan_id_type" "plan_id"      "network_tier"

## pattern of keys among plan elements => good, all plan elements have all 3 keys, perfectly consistent
keyCombosWithCount(jsonRList,list(NULL,'plans',NULL),allPlanKeys);
##   plan_id_type plan_id network_tier .count
## 1         TRUE    TRUE         TRUE  29304

## for each plan element key, get the unique set of types it takes on among all hashes (note: no plan keys are ever omitted, so don't have to worry about that) => all scalar strings
keyTypes(jsonRList,list(NULL,'plans',NULL),allPlanKeys);
##            key           tl
## 1 plan_id_type character[1]
## 2      plan_id character[1]
## 3 network_tier character[1]

## ==== name =====
## note: the name key is *not* always present under main array elements
## union of keys among all name hashes
allNameKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'name'),names)); allNameKeys;
## [1] "first"  "middle" "last"

## pattern of keys among name elements => sometimes middle is missing, relatively infrequently
keyCombosWithCount(jsonRList,list(NULL,'name'),allNameKeys);
##   first middle last .count
## 1  TRUE   TRUE TRUE   2679
## 2  TRUE  FALSE TRUE    298

## for each name element key, get the unique set of types it takes on among all hashes, ignoring hashes where the key (only middle in this case) is omitted => all scalar strings
keyTypes(jsonRList,list(NULL,'name'),allNameKeys);
##      key           tl
## 1  first character[1]
## 2 middle character[1]
## 3   last character[1]
Run Code Online (Sandbox Code Playgroud)

这是我对数据的总结:

  • 一个顶级主列表,长度为3256.
  • 每个元素都是具有不一致键集的哈希.所有主要哈希共有12个键,其中有3种键组.
  • 6个散列值是标量字符串,3个是可变长度字符串向量,addresses是可变长度plans列表,是总长度为9的列表,并且name是散列.
  • 每个addresses列表元素都是一个散列,其中有5或6个键用于标量字符串,address_2是不一致的字符串.
  • 每个plans列表元素都是一个散列,其中3个键是标量字符串,没有任何不一致.
  • 每个name哈希有firstlast,但不总是middle标串.

这里最重要的观察是并行节点之间没有类型不一致(除了遗漏和长度差异).这意味着我们可以将所有并行节点组合成向量而不考虑类型强制.我们可以将所有数据展平为二维结构,前提是我们将列与足够深的节点相关联,这样所有列都对应于输入列表中的单个标量字符串节点.

以下是我的解决方案.请注意,这取决于辅助功能tl(),keyListToStr()mkcsv()我前面定义.

##--------------------------------------
## solution
##--------------------------------------
## recursively traverse the list structure, building up a column at each leaf node
extractLevelColumns <- function(
    nodes, ## current level node selection
    ..., ## additional arguments to data.frame()
    keyList=list(), ## current key path under main list
    sep=NULL, ## optional string separator on which to join multi-element vectors; if NULL, will leave as separate columns
    mkname=function(keyList,maxLen) paste0(collapse='.',if (is.null(sep) && maxLen == 1L) keyList[-length(keyList)] else keyList) ## name builder from current keyList and character vector max length across node level; default to dot-separated keys, and remove last index component for scalars
) {
    cat(sprintf('extractLevelColumns(): %s\n',keyListToStr(keyList)));
    if (length(nodes) == 0L) return(list()); ## handle corner case of empty main list
    tlList <- lapply(nodes,tl);
    typeList <- do.call(c,lapply(tlList,`[[`,'type'));
    if (length(unique(typeList)) != 1L) stop(sprintf('error: inconsistent types (%s) at %s.',mkcsv(typeList),keyListToStr(keyList)));
    type <- typeList[1L];
    if (type == 'namedlist') { ## hash; recurse
        allKeys <- unique(do.call(c,lapply(nodes,names)));
        ret <- do.call(c,lapply(allKeys,function(key) extractLevelColumns(lapply(nodes,`[[`,key),...,keyList=c(keyList,key),sep=sep,mkname=mkname)));
    } else if (type == 'list') { ## array; recurse
        lenList <- do.call(c,lapply(tlList,`[[`,'len'));
        maxLen <- max(lenList,na.rm=T);
        allIndexes <- seq_len(maxLen);
        ret <- do.call(c,lapply(allIndexes,function(index) extractLevelColumns(lapply(nodes,function(node) if (length(node) < index) NULL else node[[index]]),...,keyList=c(keyList,index),sep=sep,mkname=mkname))); ## must be careful to guard out-of-bounds to NULL; happens automatically with string keys, but not with integer indexes
    } else if (type%in%c('raw','logical','integer','double','complex','character')) { ## atomic leaf node; build column
        lenList <- do.call(c,lapply(tlList,`[[`,'len'));
        maxLen <- max(lenList,na.rm=T);
        if (is.null(sep)) {
            ret <- lapply(seq_len(maxLen),function(i) setNames(data.frame(sapply(nodes,function(node) if (length(node) < i) NA else node[[i]]),...),mkname(c(keyList,i),maxLen)));
        } else {
            ## keep original type if maxLen is 1, IOW don't stringify
            ret <- list(setNames(data.frame(sapply(nodes,function(node) if (length(node) == 0L) NA else if (maxLen == 1L) node else paste(collapse=sep,node)),...),mkname(keyList,maxLen)));
        }; ## end if
    } else stop(sprintf('error: unsupported type %s at %s.',type,keyListToStr(keyList)));
    if (is.null(ret)) ret <- list(); ## handle corner case of exclusively empty sublists
    ret;
}; ## end extractLevelColumns()

## simple interface function
flattenList <- function(mainList,...) do.call(cbind,extractLevelColumns(mainList,...));
Run Code Online (Sandbox Code Playgroud)

extractLevelColumns()函数遍历输入列表并提取每个叶节点位置处的所有节点值,将它们组合到具有缺失值的NA的向量中,然后转换为单列data.frame.立即设置列名,利用参数化mkname()函数定义keyList字符串列名的字符串化.从每个递归调用返回多个列作为data.frames列表,同样从顶层调用返回.

它还验证并行节点之间没有类型不一致.虽然我之前手动验证了数据的一致性,但我尝试尽可能地编写通用和可重用的解决方案,因为这样做总是一个好主意,因此这个验证步骤是合适的.

flattenList() is the primary interface function; it simply calls extractLevelColumns() and then do.call(cbind,...) to combine the columns into a single data.frame.

An advantage of this solution is that it's entirely generic; it can handle an unlimited number of depth levels, by virtue of being fully recursive. Additionally, it has no package dependencies, parameterizes the column name building logic, and forwards variadic arguments to data.frame(), so for example you can pass stringsAsFactors=F to inhibit the automatic factorization of character columns normally done by data.frame(), and/or row.names={namevector} to set the row names of the resulting data.frame, or row.names=NULL to prevent the use of the top-level list component names as row names, if such existed in the input list.

我还添加了一个sep默认的参数NULL.如果NULL,多元素叶节点将被分成多个列,每个元素一个,在列名称上有一个索引后缀用于区分.否则,将其作为字符串分隔符,将所有元素连接到单个字符串,并且仅为该节点生成一个列.

在性能方面,它非常快.这是一个演示:

## actually run it
system.time({ df <- flattenList(jsonRList); });
## extractLevelColumns(): /
## extractLevelColumns(): /npi
## extractLevelColumns(): /type
## extractLevelColumns(): /facility_name
## extractLevelColumns(): /facility_type
## extractLevelColumns(): /addresses
## extractLevelColumns(): /addresses/1
## extractLevelColumns(): /addresses/1/address
## extractLevelColumns(): /addresses/1/city
##
## ... snip ...
##
## extractLevelColumns(): /plans/9/network_tier
## extractLevelColumns(): /last_updated_on
## extractLevelColumns(): /name
## extractLevelColumns(): /name/first
## extractLevelColumns(): /name/middle
## extractLevelColumns(): /name/last
## extractLevelColumns(): /speciality
## extractLevelColumns(): /accepting
## extractLevelColumns(): /languages
## extractLevelColumns(): /gender
##    user  system elapsed
##   2.265   0.000   2.268
Run Code Online (Sandbox Code Playgroud)

结果:

class(df); dim(df); names(df);
## [1] "data.frame"
## [1] 3256  126
##   [1] "npi"                    "type"                   "facility_name"          "facility_type.1"        "facility_type.2"        "facility_type.3"        "addresses.1.address"    "addresses.1.city"       "addresses.1.state"
##  [10] "addresses.1.zip"        "addresses.1.phone"      "addresses.1.address_2"  "addresses.2.address"    "addresses.2.city"       "addresses.2.state"      "addresses.2.zip"        "addresses.2.phone"      "addresses.2.address_2"
##  [19] "addresses.3.address"    "addresses.3.city"       "addresses.3.state"      "addresses.3.zip"        "addresses.3.phone"      "addresses.3.address_2"  "addresses.4.address"    "addresses.4.city"       "addresses.4.state"
##  [28] "addresses.4.zip"        "addresses.4.phone"      "addresses.4.address_2"  "addresses.5.address"    "addresses.5.address_2"  "addresses.5.city"       "addresses.5.state"      "addresses.5.zip"        "addresses.5.phone"
##  [37] "addresses.6.address"    "addresses.6.address_2"  "addresses.6.city"       "addresses.6.state"      "addresses.6.zip"        "addresses.6.phone"      "addresses.7.address"    "addresses.7.address_2"  "addresses.7.city"
##  [46] "addresses.7.state"      "addresses.7.zip"        "addresses.7.phone"      "addresses.8.address"    "addresses.8.address_2"  "addresses.8.city"       "addresses.8.state"      "addresses.8.zip"        "addresses.8.phone"
##  [55] "addresses.9.address"    "addresses.9.address_2"  "addresses.9.city"       "addresses.9.state"      "addresses.9.zip"        "addresses.9.phone"      "addresses.10.address"   "addresses.10.address_2" "addresses.10.city"
##  [64] "addresses.10.state"     "addresses.10.zip"       "addresses.10.phone"     "addresses.11.address"   "addresses.11.address_2" "addresses.11.city"      "addresses.11.state"     "addresses.11.zip"       "addresses.11.phone"
##  [73] "addresses.12.address"   "addresses.12.address_2" "addresses.12.city"      "addresses.12.state"     "addresses.12.zip"       "addresses.12.phone"     "addresses.13.address"   "addresses.13.city"      "addresses.13.state"
##  [82] "addresses.13.zip"       "addresses.13.phone"     "plans.1.plan_id_type"   "plans.1.plan_id"        "plans.1.network_tier"   "plans.2.plan_id_type"   "plans.2.plan_id"        "plans.2.network_tier"   "plans.3.plan_id_type"
##  [91] "plans.3.plan_id"        "plans.3.network_tier"   "plans.4.plan_id_type"   "plans.4.plan_id"        "plans.4.network_tier"   "plans.5.plan_id_type"   "plans.5.plan_id"        "plans.5.network_tier"   "plans.6.plan_id_type"
## [100] "plans.6.plan_id"        "plans.6.network_tier"   "plans.7.plan_id_type"   "plans.7.plan_id"        "plans.7.network_tier"   "plans.8.plan_id_type"   "plans.8.plan_id"        "plans.8.network_tier"   "plans.9.plan_id_type"
## [109] "plans.9.plan_id"        "plans.9.network_tier"   "last_updated_on"        "name.first"             "name.middle"            "name.last"              "speciality.1"           "speciality.2"           "speciality.3"
## [118] "speciality.4"           "accepting"              "languages.1"            "languages.2"            "languages.3"            "languages.4"            "languages.5"            "languages.6"            "gender"
Run Code Online (Sandbox Code Playgroud)

生成的data.frame非常广泛,但我们可以使用rowToFrame()和一次npiToFrame()获得一行的良好垂直布局.例如,这是第一行:

rowToFrame(df[1L,]);
##                     column           value
## 1                      npi      1063645026
## 2                     type        FACILITY
## 3            facility_name EXPRESS SCRIPTS
## 4          facility_type.1      Pharmacies
## 5          facility_type.2            <NA>
## 6          facility_type.3            <NA>
## 7      addresses.1.address    4750 E 450 S
## 8         addresses.1.city      WHITESTOWN
## 9        addresses.1.state              IN
## 10         addresses.1.zip           46075
## 11       addresses.1.phone      2012695236
## 12   addresses.1.address_2            <NA>
## 13     addresses.2.address            <NA>
## 14        addresses.2.city            <NA>
## 15       addresses.2.state            <NA>
## 16         addresses.2.zip            <NA>
## 17       addresses.2.phone            <NA>
## 18   addresses.2.address_2            <NA>
## 19     addresses.3.address            <NA>
## 20        addresses.3.city            <NA>
## 21       addresses.3.state            <NA>
##
## ... snip ...
##
## 77        addresses.12.zip            <NA>
## 78      addresses.12.phone            <NA>
## 79    addresses.13.address            <NA>
## 80       addresses.13.city            <NA>
## 81      addresses.13.state            <NA>
## 82        addresses.13.zip            <NA>
## 83      addresses.13.phone            <NA>
## 84    plans.1.plan_id_type    HIOS-PLAN-ID
## 85         plans.1.plan_id  38344AK0620003
## 86    plans.1.network_tier   HERITAGE-PLUS
## 87    plans.2.plan_id_type    HIOS-PLAN-ID
## 88         plans.2.plan_id  38344AK0620004
## 89    plans.2.network_tier   HERITAGE-PLUS
## 90    plans.3.plan_id_type    HIOS-PLAN-ID
## 91         plans.3.plan_id  38344AK0620006
## 92    plans.3.network_tier   HERITAGE-PLUS
## 93    plans.4.plan_id_type    HIOS-PLAN-ID
## 94         plans.4.plan_id  38344AK0620008
## 95    plans.4.network_tier   HERITAGE-PLUS
## 96    plans.5.plan_id_type    HIOS-PLAN-ID
## 97         plans.5.plan_id  38344AK0570001
## 98    plans.5.network_tier   HERITAGE-PLUS
## 99    plans.6.plan_id_type    HIOS-PLAN-ID
## 100        plans.6.plan_id  38344AK0570002
## 101   plans.6.network_tier   HERITAGE-PLUS
## 102   plans.7.plan_id_type    HIOS-PLAN-ID
## 103        plans.7.plan_id  38344AK0980003
## 104   plans.7.network_tier   HERITAGE-PLUS
## 105   plans.8.plan_id_type    HIOS-PLAN-ID
## 106        plans.8.plan_id  38344AK0980006
## 107   plans.8.network_tier   HERITAGE-PLUS
## 108   plans.9.plan_id_type    HIOS-PLAN-ID
## 109        plans.9.plan_id  38344AK0980012
## 110   plans.9.network_tier   HERITAGE-PLUS
## 111        last_updated_on      2015-10-14
## 112             name.first            <NA>
## 113            name.middle            <NA>
## 114              name.last            <NA>
## 115           speciality.1            <NA>
## 116           speciality.2            <NA>
## 117           speciality.3            <NA>
## 118           speciality.4            <NA>
## 119              accepting            <NA>
## 120            languages.1            <NA>
## 121            languages.2            <NA>
## 122            languages.3            <NA>
## 123            languages.4            <NA>
## 124            languages.5            <NA>
## 125            languages.6            <NA>
## 126                 gender            <NA>
Run Code Online (Sandbox Code Playgroud)

通过对单个记录进行多次抽查,我对结果进行了彻底的测试,结果看起来都很正确.如果您有任何疑问,请告诉我.