如何在 R 中解压元组格式?

jhy*_*eon 9 tuples r reshape dataframe data.table

这是数据集。

library(data.table)

x <- structure(list(id = c("A", "B" ),
                    segment_stemming = c("[('Brownie', 'Noun'), ('From', 'Josa'), ('Pi', 'Noun')]", 
                                          "[('Dung-caroon-gye', 'Noun'), ('in', 'Josa'), ('innovation', 'Noun')]" )), 
               row.names = c(NA, -2L), 
               class = c("data.table", "data.frame" ))

x
# id                                                     segment_stemming
# 1:  A               [('Brownie', 'Noun'), ('From', 'Josa'), ('Pi', 'Noun')]
# 2:  B [('Dung-caroon-gye', 'Noun'), ('in', 'Josa'), ('innovation', 'Noun')]

Run Code Online (Sandbox Code Playgroud)

我想将元组分成行。这是我的预期结果。

id             segment_stemming
A              ('Brownie', 'Noun')
A              ('From', 'Josa')
A              ('Pi', 'Noun')
B              ('Dung-caroon-gye', 'Noun')
B              ('in', 'Josa')
B              ('innovation', 'Noun')
Run Code Online (Sandbox Code Playgroud)

我已经使用 R 搜索了元组格式,但找不到任何线索来得出结果。

Tho*_*ing 11

data.table方法

data.table这是使用+的选项reticulate

library(reticulate)
library(data.table)
setDT(x)[
  ,
  segment_stemming := gsub("(\\(.*?\\))", '\"\\1\"', segment_stemming)
][
  ,
  lapply(.SD, py_eval),
  id
]
Run Code Online (Sandbox Code Playgroud)

这使

   id            segment_stemming
1:  A         ('Brownie', 'Noun')
2:  A            ('From', 'Josa')
3:  A              ('Pi', 'Noun')
4:  B ('Dung-caroon-gye', 'Noun')
5:  B              ('in', 'Josa')
6:  B      ('innovation', 'Noun')
Run Code Online (Sandbox Code Playgroud)

另一种使用+ 的data.table选项如下strsplittrimws

library(data.table)
setDT(x)[
  ,
  .(segment_stemming = trimws(
    unlist(strsplit(segment_stemming, "(?<=\\)),\\s+(?=\\()", perl = TRUE)),
    whitespace = "\\[|\\]"
  )),
  id
]
Run Code Online (Sandbox Code Playgroud)

给出

   id            segment_stemming
1:  A         ('Brownie', 'Noun')
2:  A            ('From', 'Josa')
3:  A              ('Pi', 'Noun')
4:  B ('Dung-caroon-gye', 'Noun')
5:  B              ('in', 'Josa')
6:  B      ('innovation', 'Noun')
Run Code Online (Sandbox Code Playgroud)

碱基R

一些基本的 R 选项应该也可以工作

with(
  x,
  setNames(
    rev(
      stack(
        tapply(
          segment_stemming,
          id,
          function(v) {
            trimws(
              unlist(strsplit(v, "(?<=\\)),\\s+(?=\\()", perl = TRUE)),
              whitespace = "\\[|\\]"
            )
          }
        )
      )
    ),
    names(x)
  )
)
Run Code Online (Sandbox Code Playgroud)

或者

with(
  x,
  setNames(
    rev(
      stack(
        setNames(
          regmatches(segment_stemming, gregexpr("\\(.*?\\)", segment_stemming)),
          id
        )
      )
    ),
    names(x)
  )
)
Run Code Online (Sandbox Code Playgroud)


Maë*_*aël 5

这是一种使用方法separate_rows

library(tidyverse)

x %>% 
  mutate(segment_stemming = gsub("\\[|\\]", "", segment_stemming)) %>% 
  separate_rows(segment_stemming, sep = ",\\s*(?![^()]*\\))")

# A tibble: 6 x 2
  id    segment_stemming           
  <chr> <chr>                      
1 A     ('Brownie', 'Noun')        
2 A     ('From', 'Josa')           
3 A     ('Pi', 'Noun')             
4 B     ('Dung-caroon-gye', 'Noun')
5 B     ('in', 'Josa')             
6 B     ('innovation', 'Noun') 
Run Code Online (Sandbox Code Playgroud)

获得更好结果的一种方法是进行一些操作(unnest_wider不是必需的)。

x %>% 
  mutate(segment_stemming = gsub("\\[|\\]", "", segment_stemming)) %>% 
  separate_rows(segment_stemming, sep = ",\\s*(?![^()]*\\))") %>% 
  mutate(segment_stemming = segment_stemming %>% 
           str_remove_all("[()',]") %>% 
           str_split(" ")) %>% 
  unnest_wider(segment_stemming)

# A tibble: 6 x 3
  id    ...1            ...2 
  <chr> <chr>           <chr>
1 A     Brownie         Noun 
2 A     From            Josa 
3 A     Pi              Noun 
4 B     Dung-caroon-gye Noun 
5 B     in              Josa 
6 B     innovation      Noun 
Run Code Online (Sandbox Code Playgroud)