Bri*_*ian 2 json nested r dataframe
我正在尝试将深度/不规则嵌套的列表/JSON 对象展平到 R 中的数据帧。
键名是一致的,但嵌套元素的数量从一个元素到另一个元素不同。
我尝试使用jsonlite和tidyr::unnest函数来展平列表,但tidyr::unnest无法取消嵌套包含多个新列的列表列。我也试过使用包中的map函数purrr,但什么也做不了。
下面是 JSON 数据的一个子集,本文末尾包含一个列表对象。
[
{
"name": ["Hillary Clinton"],
"type": ["PERSON"],
"metadata": {
"mid": ["/m/0d06m5"],
"wikipedia_url": ["http://en.wikipedia.org/wiki/Hillary_Clinton"]
},
"salience": [0.2883],
"mentions": [
{
"text": {
"content": ["Clinton"],
"beginOffset": [132]
},
"type": ["PROPER"]
},
{
"text": {
"content": ["Mrs."],
"beginOffset": [127]
},
"type": ["COMMON"]
},
{
"text": {
"content": ["Clinton"],
"beginOffset": [403]
},
"type": ["PROPER"]
},
{
"text": {
"content": ["Mrs."],
"beginOffset": [398]
},
"type": ["COMMON"]
},
{
"text": {
"content": ["Hillary Clinton"],
"beginOffset": [430]
},
"type": ["PROPER"]
}
]
},
{
"name": ["Trump"],
"type": ["PERSON"],
"metadata": {
"mid": ["/m/0cqt90"],
"wikipedia_url": ["http://en.wikipedia.org/wiki/Donald_Trump"]
},
"salience": [0.245],
"mentions": [
{
"text": {
"content": ["Trump"],
"beginOffset": [24]
},
"type": ["PROPER"]
},
{
"text": {
"content": ["Mr."],
"beginOffset": [20]
},
"type": ["COMMON"]
}
]
}
]
Run Code Online (Sandbox Code Playgroud)
所需的输出将是如下所示的数据帧,其中重复外部元素,并且每个最内部元素都有自己的行。
name type metadata.mid metadata.wikipedia_url salience mentions.text.content mentions.text.beginOffset mentions.type
Hillary Clinton PERSON /m/0d06m5 http://en.wikipedia.org/wiki/Hillary_Clinton 0.2883 Clinton 132 PROPER
Hillary Clinton PERSON /m/0d06m5 http://en.wikipedia.org/wiki/Hillary_Clinton 0.2883 Mrs. 127 COMMON
Hillary Clinton PERSON /m/0d06m5 http://en.wikipedia.org/wiki/Hillary_Clinton 0.2883 Clinton 403 PROPER
Hillary Clinton PERSON /m/0d06m5 http://en.wikipedia.org/wiki/Hillary_Clinton 0.2883 Mrs. 398 COMMON
Hillary Clinton PERSON /m/0d06m5 http://en.wikipedia.org/wiki/Hillary_Clinton 0.2883 Hillary Clinton 430 PROPER
Trump PERSON /m/0cqt90 http://en.wikipedia.org/wiki/Donald_Trump 0.245 Trump 24 PROPER
Trump PERSON /m/0cqt90 http://en.wikipedia.org/wiki/Donald_Trump 0.245 Mr. 20 COMMON
Run Code Online (Sandbox Code Playgroud)
是否有一种通用/可扩展的方法来展平这种类型的数据?
一个 R 列表对象:
nested_list <- list(structure(list(name = "Hillary Clinton", type = "PERSON",
metadata = structure(list(mid = "/m/0d06m5", wikipedia_url = "http://en.wikipedia.org/wiki/Hillary_Clinton"), .Names = c("mid",
"wikipedia_url")), salience = 0.28831193, mentions = list(
structure(list(text = structure(list(content = "Clinton",
beginOffset = 132L), .Names = c("content", "beginOffset"
)), type = "PROPER"), .Names = c("text", "type")), structure(list(
text = structure(list(content = "Mrs.", beginOffset = 127L), .Names = c("content",
"beginOffset")), type = "COMMON"), .Names = c("text",
"type")), structure(list(text = structure(list(content = "Clinton",
beginOffset = 403L), .Names = c("content", "beginOffset"
)), type = "PROPER"), .Names = c("text", "type")), structure(list(
text = structure(list(content = "Mrs.", beginOffset = 398L), .Names = c("content",
"beginOffset")), type = "COMMON"), .Names = c("text",
"type")), structure(list(text = structure(list(content = "Hillary Clinton",
beginOffset = 430L), .Names = c("content", "beginOffset"
)), type = "PROPER"), .Names = c("text", "type")))), .Names = c("name",
"type", "metadata", "salience", "mentions")), structure(list(
name = "Trump", type = "PERSON", metadata = structure(list(
mid = "/m/0cqt90", wikipedia_url = "http://en.wikipedia.org/wiki/Donald_Trump"), .Names = c("mid",
"wikipedia_url")), salience = 0.24501903, mentions = list(
structure(list(text = structure(list(content = "Trump",
beginOffset = 24L), .Names = c("content", "beginOffset"
)), type = "PROPER"), .Names = c("text", "type")), structure(list(
text = structure(list(content = "Mr.", beginOffset = 20L), .Names = c("content",
"beginOffset")), type = "COMMON"), .Names = c("text",
"type")))), .Names = c("name", "type", "metadata", "salience",
"mentions")))
Run Code Online (Sandbox Code Playgroud)
一种方法:
map_df(nested_list, function(x) {
df <- flatten_df(x[c("name", "type", "metadata", "salience")])
map_df(x$mentions, ~c(as.list(.$text), mentions_type=.$type)) %>%
mutate(name=df$name, type=df$type, mid=df$mid,
wikipedia_url=df$wikipedia_url, salience=df$salience)
}) %>% glimpse()
## Observations: 7
## Variables: 8
## $ content <chr> "Clinton", "Mrs.", "Clinton", "Mrs.", "Hillary Clinton", "Trump", "Mr."
## $ beginOffset <int> 132, 127, 403, 398, 430, 24, 20
## $ mentions_type <chr> "PROPER", "COMMON", "PROPER", "COMMON", "PROPER", "PROPER", "COMMON"
## $ name <chr> "Hillary Clinton", "Hillary Clinton", "Hillary Clinton", "Hillary Clinton", "Hillary Clinton", "Trump", "Trump"
## $ type <chr> "PERSON", "PERSON", "PERSON", "PERSON", "PERSON", "PERSON", "PERSON"
## $ mid <chr> "/m/0d06m5", "/m/0d06m5", "/m/0d06m5", "/m/0d06m5", "/m/0d06m5", "/m/0cqt90", "/m/0cqt90"
## $ wikipedia_url <chr> "http://en.wikipedia.org/wiki/Hillary_Clinton", "http://en.wikipedia.org/wiki/Hillary_Clinton", "http://en.wikipedia.org/wiki/Hillary_Clinton", "http://en.wikiped...
## $ salience <dbl> 0.2883119, 0.2883119, 0.2883119, 0.2883119, 0.2883119, 0.2450190, 0.2450190
Run Code Online (Sandbox Code Playgroud)