在 R 中编辑和过滤列表的 JSON 列表

Dav*_*ino 1 json r dplyr jsonlite tidyr

我正在尝试显示此数据集 - > https://mtgjson.com/json/AllSets.json.zip

不过,我想展平数据,以便它不会作为一堆 JSON 数据嵌套在列表中、列表中、列表中。

更具体地说,我试图将数据显示为数据框,按$releaseDate(变量之一)的顺序排列。

到目前为止,这是我的尝试:

library(jsonlite)
library(tidyjson)
mtgdata <- fromJSON("~/path/to/file.json")
Run Code Online (Sandbox Code Playgroud)

mtgdata 的结果显示了这个列表列表:

summary(mtgdata)
        Length Class  Mode
UST       9     -none- list
UNH      10     -none- list
UGL      11     -none- list
pWOS      8     -none- list
pWOR      8     -none- list
pWCQ      8     -none- list
pSUS      8     -none- list
pSUM     10     -none- list
pREL      8     -none- list
pPRO      8     -none- list
pPRE      8     -none- list
pPOD      7     -none- list
pMPR      8     -none- list
pMGD      8     -none- list
pMEI      8     -none- list
pLPA      8     -none- list
pLGM      8     -none- list
pJGP     10     -none- list
pHHO     11     -none- list
pWPN      8     -none- list
pGTW      8     -none- list
pGRU     10     -none- list
pGPX      8     -none- list
pFNM     10     -none- list
pELP      8     -none- list
pDRC      7     -none- list
pCMP      8     -none- list
pCEL      8     -none- list
pARL      8     -none- list
pALP     10     -none- list
p2HG      8     -none- list
p15A      8     -none- list
PD3       9     -none- list
PD2       9     -none- list
H09       9     -none- list
PTK      12     -none- list
POR      12     -none- list
PO2      13     -none- list
PCA       7     -none- list
PC2      10     -none- list
HOP      10     -none- list
VMA       9     -none- list
MMA      10     -none- list
MM3       8     -none- list
MM2      11     -none- list
MED       9     -none- list
ME4       9     -none- list
ME3       9     -none- list
ME2       9     -none- list
IMA       8     -none- list
EMA       9     -none- list
A25       8     -none- list
MPS_AKH   8     -none- list
MPS       9     -none- list
EXP       9     -none- list
E02       7     -none- list
V17       8     -none- list
V16       7     -none- list
V15       9     -none- list
V14       9     -none- list
V13       9     -none- list
V12      10     -none- list
V11      10     -none- list
V10       9     -none- list
V09      10     -none- list
DRB       9     -none- list
EVG       9     -none- list
DDT       7     -none- list
DDS       7     -none- list
DDR       7     -none- list
DDQ       8     -none- list
DDP      10     -none- list
DDO      10     -none- list
DDN      10     -none- list
DDM      10     -none- list
DDL      10     -none- list
DDK      10     -none- list
DDJ      10     -none- list
DDI      10     -none- list
DDH      10     -none- list
DDG      10     -none- list
DDF      10     -none- list
DDE      10     -none- list
DDD       9     -none- list
DDC       9     -none- list
DD3_JVC   9     -none- list
DD3_GVL   9     -none- list
DD3_EVG   9     -none- list
DD3_DVD   9     -none- list
DD2      11     -none- list
CNS      11     -none- list
CN2       9     -none- list
CMD      11     -none- list
CMA       7     -none- list
CM1      10     -none- list
C17       6     -none- list
C16       8     -none- list
C15      10     -none- list
C14      10     -none- list
C13      10     -none- list
CEI       9     -none- list
CED       9     -none- list
E01       7     -none- list
ARC       9     -none- list
ZEN      12     -none- list
XLN      12     -none- list
WWK      12     -none- list
WTH      13     -none- list
W17       8     -none- list
W16       8     -none- list
VIS      13     -none- list
VAN       8     -none- list
USG      13     -none- list
ULG      13     -none- list
UDS      13     -none- list
TSP      12     -none- list
TSB      12     -none- list
TPR      11     -none- list
TOR      12     -none- list
TMP      13     -none- list
THS      12     -none- list
STH      13     -none- list
SOM      12     -none- list
SOK      12     -none- list
SOI      10     -none- list
SHM      12     -none- list
SCG      12     -none- list
S99      11     -none- list
S00      11     -none- list
RTR      12     -none- list
RQS       6     -none- list
ROE      12     -none- list
RIX      12     -none- list
RAV      12     -none- list
PLS      13     -none- list
PLC      12     -none- list
PCY      13     -none- list
ORI      11     -none- list
ONS      12     -none- list
OGW      10     -none- list
ODY      13     -none- list
NPH      12     -none- list
NMS      14     -none- list
MRD      12     -none- list
MOR      12     -none- list
MMQ      13     -none- list
MIR      13     -none- list
MGB      10     -none- list
MD1       9     -none- list
MBS      12     -none- list
M15      11     -none- list
M14      11     -none- list
M13      11     -none- list
M12      11     -none- list
M11      11     -none- list
M10      11     -none- list
LRW      12     -none- list
LGN      12     -none- list
LEG      12     -none- list
LEB      11     -none- list
LEA      11     -none- list
KTK      12     -none- list
KLD       9     -none- list
JUD      12     -none- list
JOU      12     -none- list
ITP      11     -none- list
ISD      12     -none- list
INV      13     -none- list
ICE      13     -none- list
HOU       9     -none- list
HML      12     -none- list
GTC      12     -none- list
GPT      12     -none- list
FUT      12     -none- list
FRF_UGIN 10     -none- list
FRF      12     -none- list
FEM      11     -none- list
EXO      13     -none- list
EVE      12     -none- list
EMN       9     -none- list
DTK      12     -none- list
DST      12     -none- list
DRK      12     -none- list
DPA       9     -none- list
DKM       9     -none- list
DKA      12     -none- list
DIS      12     -none- list
DGM      12     -none- list
CST      11     -none- list
CSP      12     -none- list
CP3       7     -none- list
CP2       7     -none- list
CP1       7     -none- list
CON      13     -none- list
CHR      11     -none- list
CHK      12     -none- list
BTD      10     -none- list
BRB      10     -none- list
BOK      12     -none- list
BNG      12     -none- list
BFZ      12     -none- list
AVR      12     -none- list
ATQ      11     -none- list
ATH       9     -none- list
ARN      11     -none- list
ARB      12     -none- list
APC      13     -none- list
ALL      13     -none- list
ALA      12     -none- list
AKH       9     -none- list
AER       9     -none- list
9ED      12     -none- list
8ED      12     -none- list
7ED      12     -none- list
6ED      12     -none- list
5ED      12     -none- list
5DN      12     -none- list
4ED      12     -none- list
3ED      12     -none- list
2ED      11     -none- list
10E      11     -none- list
Run Code Online (Sandbox Code Playgroud)

在每个列表中都是我有兴趣分析的变量,以过滤和排序这些数据,就好像它是一个扁平的数据框一样。

当我们检查一个列表中的变量列表时(例如“mtgdata$UST”),我们得到这组变量:

names(mtgdata$UST)
[1] "name"        "code"        "releaseDate" "border"      "type"        
"booster"     "mkm_name"   
[8] "mkm_id"      "cards"
Run Code Online (Sandbox Code Playgroud)

在 mtgdata 中的另一个列表(“mtgdata$SOI”)上运行相同的查询,我们得到另一组变量,尽管它们大部分是相同的。

正如我上面提到的,我主要感兴趣的是压平这个数据集并按 mtgdata$releaseDate 进行排名 - 但就目前情况而言,“$releaseDate”当前嵌套在第一组列表中(“$UST”等)

非常感谢您对此的帮助或我如何更好地重新表述这个问题。

hrb*_*str 5

您可以在命令行上尝试类似的操作,将 JSON 对象数组转换为文件 ndjson 记录,然后使用类似的方法ndjson::stream_in("filename_of the_thing_you_just_converted")将 JSON 对象数组转换为文件 ndjson 记录,然后使用类似的操作,但最终会得到 14,000 多个列,非常无用的“平面”数据框。

\n\n

相反,做一些探索:

\n\n
library(tidyverse)\n\nas1 <- jsonlite::read_json("~/Downloads/AllSets.json")\n\nstr(as1, 1) \n## List of 221\n##  $ UST     :List of 9\n##  $ UNH     :List of 10\n##  $ UGL     :List of 11\n##  $ pWOS    :List of 8\n##  $ pWOR    :List of 8\n##  $ pWCQ    :List of 8\n##  $ pSUS    :List of 8\n##  $ pSUM    :List of 10\n##  $ pREL    :List of 8\n##  $ pPRO    :List of 8\n##  $ pPRE    :List of 8\n##  $ pPOD    :List of 7\n##  $ pMPR    :List of 8\n##  $ pMGD    :List of 8\n##  $ pMEI    :List of 8\n##  $ pLPA    :List of 8\n##  $ pLGM    :List of 8\n##  $ pJGP    :List of 10\n##  $ pHHO    :List of 11\n## ...\n
Run Code Online (Sandbox Code Playgroud)\n\n

呃\xe2\x80\xa6“那些”JSON 文件中的一个似乎不适合填充每个记录的所有元素,即使整个文件理论上是 \xe2\x80\x94 \xe2\x80\x94\ xc2\xa0 应该是一致的。

\n\n

让我们看看哪些 JSON 数组元素填充的字段数量最多,因为这意味着这些元素可能已全部填充:

\n\n
map_dbl(as1, length) %>% \n  broom::tidy() %>% \n  arrange(desc(x))\n## # A tibble: 221 x 2\n##    names     x\n##    <chr> <dbl>\n##  1 NMS    14.0\n##  2 PO2    13.0\n##  3 WTH    13.0\n##  4 VIS    13.0\n##  5 USG    13.0\n##  6 ULG    13.0\n##  7 UDS    13.0\n##  8 TMP    13.0\n##  9 STH    13.0\n## 10 PLS    13.0\n## # ... with 211 more rows\n
Run Code Online (Sandbox Code Playgroud)\n\n

我们来看看NMS

\n\n
str(as1[["NMS"]], 1)\n## List of 14\n##  $ name              : chr "Nemesis"\n##  $ code              : chr "NMS"\n##  $ gathererCode      : chr "NE"\n##  $ magicCardsInfoCode: chr "ne"\n##  $ oldCode           : chr "NEM"\n##  $ releaseDate       : chr "2000-02-14"\n##  $ border            : chr "black"\n##  $ type              : chr "expansion"\n##  $ block             : chr "Masques"\n##  $ booster           :List of 15\n##  $ translations      :List of 5\n##  $ mkm_name          : chr "Nemesis"\n##  $ mkm_id            : int 32\n##  $ cards             :List of 143\n
Run Code Online (Sandbox Code Playgroud)\n\n

真的不想展平boostertranslations或者cards应该将它们保留为list列并且unnest根据需要。

\n\n

但是,由于每个记录都有不同的字段,我们不能简单地“data.table::rbindlist()or dplyr::bind_rows()`,因为它会抱怨其中的一些列。

\n\n

我们必须逐条记录并将每个记录转换为数据帧,处理丢失的字段并将其包装listlist(). 我们将使用辅助函数来简化函数惯用法来测试缺失值:

\n\n
`%l0%` <- function(x, y) if (length(x) > 0) x else y\n
Run Code Online (Sandbox Code Playgroud)\n\n

%||%^^比 的骑行功能更强大一点purrr

\n\n

最后:

\n\n
map_df(as1, ~{\n  data_frame(\n    name = .x$name %l0% NA_character_,\n    code = .x$code,\n    gathererCode = .x$gathererCode %l0% NA_character_,\n    magicCardsInfoCode = .x$magicCardsInfoCode %l0% NA_character_,\n    oldCode = .x$oldCode %l0% NA_character_,\n    releaseDate = .x$releaseDate %l0% NA_character_,\n    border = .x$border,\n    type = .x$type,\n    block = .x$block %l0% NA_character_,\n    booster = list(.x$booster),\n    translations = list(.x$translations),\n    mkm_name = .x$mkm_name %l0% NA_character_,\n    mkm_id = .x$mkm_id %l0% NA_character_,\n    cards = list(.x$cards)\n  )\n}) -> all_sets\n
Run Code Online (Sandbox Code Playgroud)\n\n

并且,您可以看到结果:

\n\n
all_sets\n## # A tibble: 221 x 14\n##    name           code  gathererCode magicCardsInfoC\xe2\x80\xa6 oldCode releaseDate border type  block booster \n##    <chr>          <chr> <chr>        <chr>            <chr>   <chr>       <chr>  <chr> <chr> <list>  \n##  1 Unstable       UST   NA           NA               NA      2017-12-08  silver un    NA    <list [\xe2\x80\xa6\n##  2 Unhinged       UNH   NA           uh               NA      2004-11-20  silver un    NA    <list [\xe2\x80\xa6\n##  3 Unglued        UGL   UG           ug               NA      1998-08-11  silver un    NA    <list [\xe2\x80\xa6\n##  4 Wizards of th\xe2\x80\xa6 pWOS  NA           wotc             NA      1999-09-04  black  promo NA    <NULL>  \n##  5 Worlds         pWOR  NA           wrl              NA      1999-08-04  black  promo NA    <NULL>  \n##  6 World Magic C\xe2\x80\xa6 pWCQ  NA           wmcq             NA      2013-04-06  black  promo NA    <NULL>  \n##  7 Super Series   pSUS  NA           sus              NA      1999-12-01  black  promo NA    <NULL>  \n##  8 Summer of Mag\xe2\x80\xa6 pSUM  NA           sum              NA      2007-07-21  black  promo NA    <NULL>  \n##  9 Release Events pREL  NA           rep              NA      2003-07-26  black  promo NA    <NULL>  \n## 10 Pro Tour       pPRO  NA           pro              NA      2007-02-09  black  promo NA    <NULL>  \n## # ... with 211 more rows, and 4 more variables: translations <list>, mkm_name <chr>, mkm_id <int>,\n## #   cards <list>\n\nglimpse(all_sets)\n## Observations: 221\n## Variables: 14\n## $ name               <chr> "Unstable", "Unhinged", "Unglued", "Wizards of the Coast Online Store"...\n## $ code               <chr> "UST", "UNH", "UGL", "pWOS", "pWOR", "pWCQ", "pSUS", "pSUM", "pREL", "...\n## $ gathererCode       <chr> NA, NA, "UG", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...\n## $ magicCardsInfoCode <chr> NA, "uh", "ug", "wotc", "wrl", "wmcq", "sus", "sum", "rep", "pro", "pt...\n## $ oldCode            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...\n## $ releaseDate        <chr> "2017-12-08", "2004-11-20", "1998-08-11", "1999-09-04", "1999-08-04", ...\n## $ border             <chr> "silver", "silver", "silver", "black", "black", "black", "black", "bla...\n## $ type               <chr> "un", "un", "un", "promo", "promo", "promo", "promo", "promo", "promo"...\n## $ block              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...\n## $ booster            <list> [["rare", "uncommon", "uncommon", "uncommon", "common", "common", "co...\n## $ translations       <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NU...\n## $ mkm_name           <chr> "Unstable", "Unhinged", "Unglued", NA, NA, NA, NA, "Summer Magic", NA,...\n## $ mkm_id             <int> 1821, 59, 22, NA, NA, NA, NA, 76, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...\n## $ cards              <list> [[["Andrea Radeck", 1, ["W"], ["White"], "95ebdf85f4ea74d584dfdfb72e3...\n
Run Code Online (Sandbox Code Playgroud)\n\n

并且,我们可以在releaseDate将列转换为适当的Date对象后排列它们:

\n\n
mutate(all_sets, releaseDate = lubridate::ymd(releaseDate)) %>% \n  arrange(desc(releaseDate))\n## # A tibble: 221 x 14\n##    name        code  gathererCode magicCardsInfoCo\xe2\x80\xa6 oldCode releaseDate border type     block booster\n##    <chr>       <chr> <chr>        <chr>             <chr>   <date>      <chr>  <chr>    <chr> <list> \n##  1 Masters 25  A25   NA           a25               NA      2018-03-16  black  reprint  NA    <NULL> \n##  2 Rivals of \xe2\x80\xa6 RIX   NA           rix               NA      2018-01-19  black  expansi\xe2\x80\xa6 Ixal\xe2\x80\xa6 <list \xe2\x80\xa6\n##  3 Unstable    UST   NA           NA                NA      2017-12-08  silver un       NA    <list \xe2\x80\xa6\n##  4 Explorers \xe2\x80\xa6 E02   NA           e02               NA      2017-11-24  black  board g\xe2\x80\xa6 NA    <NULL> \n##  5 From the V\xe2\x80\xa6 V17   NA           v17               NA      2017-11-24  black  from th\xe2\x80\xa6 NA    <NULL> \n##  6 Iconic Mas\xe2\x80\xa6 IMA   NA           ima               NA      2017-11-17  black  reprint  NA    <list \xe2\x80\xa6\n##  7 Duel Decks\xe2\x80\xa6 DDT   NA           ddt               NA      2017-11-10  black  duel de\xe2\x80\xa6 NA    <NULL> \n##  8 Ixalan      XLN   NA           xln               NA      2017-09-29  black  expansi\xe2\x80\xa6 Ixal\xe2\x80\xa6 <list \xe2\x80\xa6\n##  9 Commander \xe2\x80\xa6 C17   NA           NA                NA      2017-08-25  black  command\xe2\x80\xa6 NA    <NULL> \n## 10 Hour of De\xe2\x80\xa6 HOU   NA           hou               NA      2017-07-14  black  expansi\xe2\x80\xa6 Amon\xe2\x80\xa6 <list \xe2\x80\xa6\n## # ... with 211 more rows, and 4 more variables: translations <list>, mkm_name <chr>, mkm_id <int>,\n## #   cards <list>\n
Run Code Online (Sandbox Code Playgroud)\n