Dav*_*ino 1 json r dplyr jsonlite tidyr
我正在尝试显示此数据集 - > https://mtgjson.com/json/AllSets.json.zip
不过,我想展平数据,以便它不会作为一堆 JSON 数据嵌套在列表中、列表中、列表中。
更具体地说,我试图将数据显示为数据框,按$releaseDate(变量之一)的顺序排列。
到目前为止,这是我的尝试:
library(jsonlite)
library(tidyjson)
mtgdata <- fromJSON("~/path/to/file.json")
Run Code Online (Sandbox Code Playgroud)
mtgdata 的结果显示了这个列表列表:
summary(mtgdata)
Length Class Mode
UST 9 -none- list
UNH 10 -none- list
UGL 11 -none- list
pWOS 8 -none- list
pWOR 8 -none- list
pWCQ 8 -none- list
pSUS 8 -none- list
pSUM 10 -none- list
pREL 8 -none- list
pPRO 8 -none- list
pPRE 8 -none- list
pPOD 7 -none- list
pMPR 8 -none- list
pMGD 8 -none- list
pMEI 8 -none- list
pLPA 8 -none- list
pLGM 8 -none- list
pJGP 10 -none- list
pHHO 11 -none- list
pWPN 8 -none- list
pGTW 8 -none- list
pGRU 10 -none- list
pGPX 8 -none- list
pFNM 10 -none- list
pELP 8 -none- list
pDRC 7 -none- list
pCMP 8 -none- list
pCEL 8 -none- list
pARL 8 -none- list
pALP 10 -none- list
p2HG 8 -none- list
p15A 8 -none- list
PD3 9 -none- list
PD2 9 -none- list
H09 9 -none- list
PTK 12 -none- list
POR 12 -none- list
PO2 13 -none- list
PCA 7 -none- list
PC2 10 -none- list
HOP 10 -none- list
VMA 9 -none- list
MMA 10 -none- list
MM3 8 -none- list
MM2 11 -none- list
MED 9 -none- list
ME4 9 -none- list
ME3 9 -none- list
ME2 9 -none- list
IMA 8 -none- list
EMA 9 -none- list
A25 8 -none- list
MPS_AKH 8 -none- list
MPS 9 -none- list
EXP 9 -none- list
E02 7 -none- list
V17 8 -none- list
V16 7 -none- list
V15 9 -none- list
V14 9 -none- list
V13 9 -none- list
V12 10 -none- list
V11 10 -none- list
V10 9 -none- list
V09 10 -none- list
DRB 9 -none- list
EVG 9 -none- list
DDT 7 -none- list
DDS 7 -none- list
DDR 7 -none- list
DDQ 8 -none- list
DDP 10 -none- list
DDO 10 -none- list
DDN 10 -none- list
DDM 10 -none- list
DDL 10 -none- list
DDK 10 -none- list
DDJ 10 -none- list
DDI 10 -none- list
DDH 10 -none- list
DDG 10 -none- list
DDF 10 -none- list
DDE 10 -none- list
DDD 9 -none- list
DDC 9 -none- list
DD3_JVC 9 -none- list
DD3_GVL 9 -none- list
DD3_EVG 9 -none- list
DD3_DVD 9 -none- list
DD2 11 -none- list
CNS 11 -none- list
CN2 9 -none- list
CMD 11 -none- list
CMA 7 -none- list
CM1 10 -none- list
C17 6 -none- list
C16 8 -none- list
C15 10 -none- list
C14 10 -none- list
C13 10 -none- list
CEI 9 -none- list
CED 9 -none- list
E01 7 -none- list
ARC 9 -none- list
ZEN 12 -none- list
XLN 12 -none- list
WWK 12 -none- list
WTH 13 -none- list
W17 8 -none- list
W16 8 -none- list
VIS 13 -none- list
VAN 8 -none- list
USG 13 -none- list
ULG 13 -none- list
UDS 13 -none- list
TSP 12 -none- list
TSB 12 -none- list
TPR 11 -none- list
TOR 12 -none- list
TMP 13 -none- list
THS 12 -none- list
STH 13 -none- list
SOM 12 -none- list
SOK 12 -none- list
SOI 10 -none- list
SHM 12 -none- list
SCG 12 -none- list
S99 11 -none- list
S00 11 -none- list
RTR 12 -none- list
RQS 6 -none- list
ROE 12 -none- list
RIX 12 -none- list
RAV 12 -none- list
PLS 13 -none- list
PLC 12 -none- list
PCY 13 -none- list
ORI 11 -none- list
ONS 12 -none- list
OGW 10 -none- list
ODY 13 -none- list
NPH 12 -none- list
NMS 14 -none- list
MRD 12 -none- list
MOR 12 -none- list
MMQ 13 -none- list
MIR 13 -none- list
MGB 10 -none- list
MD1 9 -none- list
MBS 12 -none- list
M15 11 -none- list
M14 11 -none- list
M13 11 -none- list
M12 11 -none- list
M11 11 -none- list
M10 11 -none- list
LRW 12 -none- list
LGN 12 -none- list
LEG 12 -none- list
LEB 11 -none- list
LEA 11 -none- list
KTK 12 -none- list
KLD 9 -none- list
JUD 12 -none- list
JOU 12 -none- list
ITP 11 -none- list
ISD 12 -none- list
INV 13 -none- list
ICE 13 -none- list
HOU 9 -none- list
HML 12 -none- list
GTC 12 -none- list
GPT 12 -none- list
FUT 12 -none- list
FRF_UGIN 10 -none- list
FRF 12 -none- list
FEM 11 -none- list
EXO 13 -none- list
EVE 12 -none- list
EMN 9 -none- list
DTK 12 -none- list
DST 12 -none- list
DRK 12 -none- list
DPA 9 -none- list
DKM 9 -none- list
DKA 12 -none- list
DIS 12 -none- list
DGM 12 -none- list
CST 11 -none- list
CSP 12 -none- list
CP3 7 -none- list
CP2 7 -none- list
CP1 7 -none- list
CON 13 -none- list
CHR 11 -none- list
CHK 12 -none- list
BTD 10 -none- list
BRB 10 -none- list
BOK 12 -none- list
BNG 12 -none- list
BFZ 12 -none- list
AVR 12 -none- list
ATQ 11 -none- list
ATH 9 -none- list
ARN 11 -none- list
ARB 12 -none- list
APC 13 -none- list
ALL 13 -none- list
ALA 12 -none- list
AKH 9 -none- list
AER 9 -none- list
9ED 12 -none- list
8ED 12 -none- list
7ED 12 -none- list
6ED 12 -none- list
5ED 12 -none- list
5DN 12 -none- list
4ED 12 -none- list
3ED 12 -none- list
2ED 11 -none- list
10E 11 -none- list
Run Code Online (Sandbox Code Playgroud)
在每个列表中都是我有兴趣分析的变量,以过滤和排序这些数据,就好像它是一个扁平的数据框一样。
当我们检查一个列表中的变量列表时(例如“mtgdata$UST”),我们得到这组变量:
names(mtgdata$UST)
[1] "name" "code" "releaseDate" "border" "type"
"booster" "mkm_name"
[8] "mkm_id" "cards"
Run Code Online (Sandbox Code Playgroud)
在 mtgdata 中的另一个列表(“mtgdata$SOI”)上运行相同的查询,我们得到另一组变量,尽管它们大部分是相同的。
正如我上面提到的,我主要感兴趣的是压平这个数据集并按 mtgdata$releaseDate 进行排名 - 但就目前情况而言,“$releaseDate”当前嵌套在第一组列表中(“$UST”等)
非常感谢您对此的帮助或我如何更好地重新表述这个问题。
您可以在命令行上尝试类似的操作,将 JSON 对象数组转换为文件 ndjson 记录,然后使用类似的方法ndjson::stream_in("filename_of the_thing_you_just_converted")将 JSON 对象数组转换为文件 ndjson 记录,然后使用类似的操作,但最终会得到 14,000 多个列,非常无用的“平面”数据框。
相反,做一些探索:
\n\nlibrary(tidyverse)\n\nas1 <- jsonlite::read_json("~/Downloads/AllSets.json")\n\nstr(as1, 1) \n## List of 221\n## $ UST :List of 9\n## $ UNH :List of 10\n## $ UGL :List of 11\n## $ pWOS :List of 8\n## $ pWOR :List of 8\n## $ pWCQ :List of 8\n## $ pSUS :List of 8\n## $ pSUM :List of 10\n## $ pREL :List of 8\n## $ pPRO :List of 8\n## $ pPRE :List of 8\n## $ pPOD :List of 7\n## $ pMPR :List of 8\n## $ pMGD :List of 8\n## $ pMEI :List of 8\n## $ pLPA :List of 8\n## $ pLGM :List of 8\n## $ pJGP :List of 10\n## $ pHHO :List of 11\n## ...\nRun Code Online (Sandbox Code Playgroud)\n\n呃\xe2\x80\xa6“那些”JSON 文件中的一个似乎不适合填充每个记录的所有元素,即使整个文件理论上是 \xe2\x80\x94 \xe2\x80\x94\ xc2\xa0 应该是一致的。
\n\n让我们看看哪些 JSON 数组元素填充的字段数量最多,因为这意味着这些元素可能已全部填充:
\n\nmap_dbl(as1, length) %>% \n broom::tidy() %>% \n arrange(desc(x))\n## # A tibble: 221 x 2\n## names x\n## <chr> <dbl>\n## 1 NMS 14.0\n## 2 PO2 13.0\n## 3 WTH 13.0\n## 4 VIS 13.0\n## 5 USG 13.0\n## 6 ULG 13.0\n## 7 UDS 13.0\n## 8 TMP 13.0\n## 9 STH 13.0\n## 10 PLS 13.0\n## # ... with 211 more rows\nRun Code Online (Sandbox Code Playgroud)\n\n我们来看看NMS:
str(as1[["NMS"]], 1)\n## List of 14\n## $ name : chr "Nemesis"\n## $ code : chr "NMS"\n## $ gathererCode : chr "NE"\n## $ magicCardsInfoCode: chr "ne"\n## $ oldCode : chr "NEM"\n## $ releaseDate : chr "2000-02-14"\n## $ border : chr "black"\n## $ type : chr "expansion"\n## $ block : chr "Masques"\n## $ booster :List of 15\n## $ translations :List of 5\n## $ mkm_name : chr "Nemesis"\n## $ mkm_id : int 32\n## $ cards :List of 143\nRun Code Online (Sandbox Code Playgroud)\n\n你真的不想展平booster,translations或者cards应该将它们保留为list列并且unnest根据需要。
但是,由于每个记录都有不同的字段,我们不能简单地“data.table::rbindlist()or dplyr::bind_rows()`,因为它会抱怨其中的一些列。
我们必须逐条记录并将每个记录转换为数据帧,处理丢失的字段并将其包装list在list(). 我们将使用辅助函数来简化函数惯用法来测试缺失值:
`%l0%` <- function(x, y) if (length(x) > 0) x else y\nRun Code Online (Sandbox Code Playgroud)\n\n%||%^^比 的骑行功能更强大一点purrr。
最后:
\n\nmap_df(as1, ~{\n data_frame(\n name = .x$name %l0% NA_character_,\n code = .x$code,\n gathererCode = .x$gathererCode %l0% NA_character_,\n magicCardsInfoCode = .x$magicCardsInfoCode %l0% NA_character_,\n oldCode = .x$oldCode %l0% NA_character_,\n releaseDate = .x$releaseDate %l0% NA_character_,\n border = .x$border,\n type = .x$type,\n block = .x$block %l0% NA_character_,\n booster = list(.x$booster),\n translations = list(.x$translations),\n mkm_name = .x$mkm_name %l0% NA_character_,\n mkm_id = .x$mkm_id %l0% NA_character_,\n cards = list(.x$cards)\n )\n}) -> all_sets\nRun Code Online (Sandbox Code Playgroud)\n\n并且,您可以看到结果:
\n\nall_sets\n## # A tibble: 221 x 14\n## name code gathererCode magicCardsInfoC\xe2\x80\xa6 oldCode releaseDate border type block booster \n## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <list> \n## 1 Unstable UST NA NA NA 2017-12-08 silver un NA <list [\xe2\x80\xa6\n## 2 Unhinged UNH NA uh NA 2004-11-20 silver un NA <list [\xe2\x80\xa6\n## 3 Unglued UGL UG ug NA 1998-08-11 silver un NA <list [\xe2\x80\xa6\n## 4 Wizards of th\xe2\x80\xa6 pWOS NA wotc NA 1999-09-04 black promo NA <NULL> \n## 5 Worlds pWOR NA wrl NA 1999-08-04 black promo NA <NULL> \n## 6 World Magic C\xe2\x80\xa6 pWCQ NA wmcq NA 2013-04-06 black promo NA <NULL> \n## 7 Super Series pSUS NA sus NA 1999-12-01 black promo NA <NULL> \n## 8 Summer of Mag\xe2\x80\xa6 pSUM NA sum NA 2007-07-21 black promo NA <NULL> \n## 9 Release Events pREL NA rep NA 2003-07-26 black promo NA <NULL> \n## 10 Pro Tour pPRO NA pro NA 2007-02-09 black promo NA <NULL> \n## # ... with 211 more rows, and 4 more variables: translations <list>, mkm_name <chr>, mkm_id <int>,\n## # cards <list>\n\nglimpse(all_sets)\n## Observations: 221\n## Variables: 14\n## $ name <chr> "Unstable", "Unhinged", "Unglued", "Wizards of the Coast Online Store"...\n## $ code <chr> "UST", "UNH", "UGL", "pWOS", "pWOR", "pWCQ", "pSUS", "pSUM", "pREL", "...\n## $ gathererCode <chr> NA, NA, "UG", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...\n## $ magicCardsInfoCode <chr> NA, "uh", "ug", "wotc", "wrl", "wmcq", "sus", "sum", "rep", "pro", "pt...\n## $ oldCode <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...\n## $ releaseDate <chr> "2017-12-08", "2004-11-20", "1998-08-11", "1999-09-04", "1999-08-04", ...\n## $ border <chr> "silver", "silver", "silver", "black", "black", "black", "black", "bla...\n## $ type <chr> "un", "un", "un", "promo", "promo", "promo", "promo", "promo", "promo"...\n## $ block <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...\n## $ booster <list> [["rare", "uncommon", "uncommon", "uncommon", "common", "common", "co...\n## $ translations <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NU...\n## $ mkm_name <chr> "Unstable", "Unhinged", "Unglued", NA, NA, NA, NA, "Summer Magic", NA,...\n## $ mkm_id <int> 1821, 59, 22, NA, NA, NA, NA, 76, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...\n## $ cards <list> [[["Andrea Radeck", 1, ["W"], ["White"], "95ebdf85f4ea74d584dfdfb72e3...\nRun Code Online (Sandbox Code Playgroud)\n\n并且,我们可以在releaseDate将列转换为适当的Date对象后排列它们:
mutate(all_sets, releaseDate = lubridate::ymd(releaseDate)) %>% \n arrange(desc(releaseDate))\n## # A tibble: 221 x 14\n## name code gathererCode magicCardsInfoCo\xe2\x80\xa6 oldCode releaseDate border type block booster\n## <chr> <chr> <chr> <chr> <chr> <date> <chr> <chr> <chr> <list> \n## 1 Masters 25 A25 NA a25 NA 2018-03-16 black reprint NA <NULL> \n## 2 Rivals of \xe2\x80\xa6 RIX NA rix NA 2018-01-19 black expansi\xe2\x80\xa6 Ixal\xe2\x80\xa6 <list \xe2\x80\xa6\n## 3 Unstable UST NA NA NA 2017-12-08 silver un NA <list \xe2\x80\xa6\n## 4 Explorers \xe2\x80\xa6 E02 NA e02 NA 2017-11-24 black board g\xe2\x80\xa6 NA <NULL> \n## 5 From the V\xe2\x80\xa6 V17 NA v17 NA 2017-11-24 black from th\xe2\x80\xa6 NA <NULL> \n## 6 Iconic Mas\xe2\x80\xa6 IMA NA ima NA 2017-11-17 black reprint NA <list \xe2\x80\xa6\n## 7 Duel Decks\xe2\x80\xa6 DDT NA ddt NA 2017-11-10 black duel de\xe2\x80\xa6 NA <NULL> \n## 8 Ixalan XLN NA xln NA 2017-09-29 black expansi\xe2\x80\xa6 Ixal\xe2\x80\xa6 <list \xe2\x80\xa6\n## 9 Commander \xe2\x80\xa6 C17 NA NA NA 2017-08-25 black command\xe2\x80\xa6 NA <NULL> \n## 10 Hour of De\xe2\x80\xa6 HOU NA hou NA 2017-07-14 black expansi\xe2\x80\xa6 Amon\xe2\x80\xa6 <list \xe2\x80\xa6\n## # ... with 211 more rows, and 4 more variables: translations <list>, mkm_name <chr>, mkm_id <int>,\n## # cards <list>\nRun Code Online (Sandbox Code Playgroud)\n