jhy*_*eon 9 tuples r reshape dataframe data.table
这是数据集。
library(data.table)
x <- structure(list(id = c("A", "B" ),
segment_stemming = c("[('Brownie', 'Noun'), ('From', 'Josa'), ('Pi', 'Noun')]",
"[('Dung-caroon-gye', 'Noun'), ('in', 'Josa'), ('innovation', 'Noun')]" )),
row.names = c(NA, -2L),
class = c("data.table", "data.frame" ))
x
# id segment_stemming
# 1: A [('Brownie', 'Noun'), ('From', 'Josa'), ('Pi', 'Noun')]
# 2: B [('Dung-caroon-gye', 'Noun'), ('in', 'Josa'), ('innovation', 'Noun')]
Run Code Online (Sandbox Code Playgroud)
我想将元组分成行。这是我的预期结果。
id segment_stemming
A ('Brownie', 'Noun')
A ('From', 'Josa')
A ('Pi', 'Noun')
B ('Dung-caroon-gye', 'Noun')
B ('in', 'Josa')
B ('innovation', 'Noun')
Run Code Online (Sandbox Code Playgroud)
我已经使用 R 搜索了元组格式,但找不到任何线索来得出结果。
Tho*_*ing 11
data.table
方法data.table
这是使用+的选项reticulate
library(reticulate)
library(data.table)
setDT(x)[
,
segment_stemming := gsub("(\\(.*?\\))", '\"\\1\"', segment_stemming)
][
,
lapply(.SD, py_eval),
id
]
Run Code Online (Sandbox Code Playgroud)
这使
id segment_stemming
1: A ('Brownie', 'Noun')
2: A ('From', 'Josa')
3: A ('Pi', 'Noun')
4: B ('Dung-caroon-gye', 'Noun')
5: B ('in', 'Josa')
6: B ('innovation', 'Noun')
Run Code Online (Sandbox Code Playgroud)
另一种使用+ 的data.table
选项如下strsplit
trimws
library(data.table)
setDT(x)[
,
.(segment_stemming = trimws(
unlist(strsplit(segment_stemming, "(?<=\\)),\\s+(?=\\()", perl = TRUE)),
whitespace = "\\[|\\]"
)),
id
]
Run Code Online (Sandbox Code Playgroud)
给出
id segment_stemming
1: A ('Brownie', 'Noun')
2: A ('From', 'Josa')
3: A ('Pi', 'Noun')
4: B ('Dung-caroon-gye', 'Noun')
5: B ('in', 'Josa')
6: B ('innovation', 'Noun')
Run Code Online (Sandbox Code Playgroud)
一些基本的 R 选项应该也可以工作
with(
x,
setNames(
rev(
stack(
tapply(
segment_stemming,
id,
function(v) {
trimws(
unlist(strsplit(v, "(?<=\\)),\\s+(?=\\()", perl = TRUE)),
whitespace = "\\[|\\]"
)
}
)
)
),
names(x)
)
)
Run Code Online (Sandbox Code Playgroud)
或者
with(
x,
setNames(
rev(
stack(
setNames(
regmatches(segment_stemming, gregexpr("\\(.*?\\)", segment_stemming)),
id
)
)
),
names(x)
)
)
Run Code Online (Sandbox Code Playgroud)
这是一种使用方法separate_rows
:
library(tidyverse)
x %>%
mutate(segment_stemming = gsub("\\[|\\]", "", segment_stemming)) %>%
separate_rows(segment_stemming, sep = ",\\s*(?![^()]*\\))")
# A tibble: 6 x 2
id segment_stemming
<chr> <chr>
1 A ('Brownie', 'Noun')
2 A ('From', 'Josa')
3 A ('Pi', 'Noun')
4 B ('Dung-caroon-gye', 'Noun')
5 B ('in', 'Josa')
6 B ('innovation', 'Noun')
Run Code Online (Sandbox Code Playgroud)
获得更好结果的一种方法是进行一些操作(unnest_wider
不是必需的)。
x %>%
mutate(segment_stemming = gsub("\\[|\\]", "", segment_stemming)) %>%
separate_rows(segment_stemming, sep = ",\\s*(?![^()]*\\))") %>%
mutate(segment_stemming = segment_stemming %>%
str_remove_all("[()',]") %>%
str_split(" ")) %>%
unnest_wider(segment_stemming)
# A tibble: 6 x 3
id ...1 ...2
<chr> <chr> <chr>
1 A Brownie Noun
2 A From Josa
3 A Pi Noun
4 B Dung-caroon-gye Noun
5 B in Josa
6 B innovation Noun
Run Code Online (Sandbox Code Playgroud)