导入和分析R中的非矩形.csv文件

Mar*_*ley 3 import r

我从Mathematica转到R,在输入过程中我不需要预测数据结构,特别是在导入之前我不需要预测数据的矩形.

我有很多文件.csv文件格式如下:

tasty,chicken,cinnamon
not_tasty,butter,pepper,onion,cardamom,cayenne
tasty,olive_oil,pepper
okay,olive_oil,onion,potato,black_pepper
not_tasty,tomato,fenugreek,pepper,onion,potato
tasty,butter,cheese,wheat,ham
Run Code Online (Sandbox Code Playgroud)

行具有不同的长度,仅包含字符串.

在R中,我该如何处理这个问题?

你试过什么?

我尝试过read.table:

dataImport <- read.table("data.csv", header = FALSE)
class(dataImport)
##[1] "data.frame"
dim(dataImport)
##[1] 6   1
dataImport[1]
##[1] tasty,chicken,cinnamon
##6 Levels: ...
Run Code Online (Sandbox Code Playgroud)

我从文档中将其解释为一个单独的列,每个成分列表作为一个独特的行.我可以按如下方式提取前三行,每行都class factor包含但看起来包含的数据多于我的预期:

dataImport[c(1,2,3),1]
## my rows
rowOne <- dataImport[c(1),1];
class(rowOne)
## "factor"
rowOne
## [1] tasty,chicken,cinnamon
## 6 Levels: not_tasty,butter,cheese [...]
Run Code Online (Sandbox Code Playgroud)

这就是我现在所追求的这个问题,我希望read.table对这个数据结构的适用性提出建议.

我的目标是按每行的第一个元素对数据进行分组,并分析每种类型的配方之间的差异.如果它有助于影响数据结构建议,在Mathematica中我会做以下事情:

dataImport=Import["data.csv"];
tasty = Cases[dataImport, {"tasty", ingr__} :> {ingr}]
Run Code Online (Sandbox Code Playgroud)

回答讨论

@ G.Grothendieck提供了使用read.table和后续处理的解决方案reshape2- 这看起来非常有用,我稍后会进行调查.这里的一般建议解决了我的问题,因此接受.

@ MrFlick建议使用该tm软件包对以后的分析有用DataframeSource

G. *_*eck 5

函数read.table尝试read.tablefill=TRUE:

d1 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE)
Run Code Online (Sandbox Code Playgroud)

赠送:

> d1
         V1        V2        V3     V4           V5      V6
1     tasty   chicken  cinnamon                            
2 not_tasty    butter    pepper  onion     cardamom cayenne
3     tasty olive_oil    pepper                            
4      okay olive_oil     onion potato black_pepper        
5 not_tasty    tomato fenugreek pepper        onion  potato
6     tasty    butter    cheese  wheat          ham   
Run Code Online (Sandbox Code Playgroud)

与NA一起阅读

或者用NA值填充空单元格na.strings = "":

d2 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE, na.strings = "")
Run Code Online (Sandbox Code Playgroud)

赠送:

> d2
         V1        V2        V3     V4           V5      V6
1     tasty   chicken  cinnamon   <NA>         <NA>    <NA>
2 not_tasty    butter    pepper  onion     cardamom cayenne
3     tasty olive_oil    pepper   <NA>         <NA>    <NA>
4      okay olive_oil     onion potato black_pepper    <NA>
5 not_tasty    tomato fenugreek pepper        onion  potato
6     tasty    butter    cheese  wheat          ham    <NA>
Run Code Online (Sandbox Code Playgroud)

长表

如果你想要长形式:

library(reshape2)
long <- na.omit(melt(d2, id.var = c("id", "V1"))[-3])
long <- long[order(long$id), ]
Run Code Online (Sandbox Code Playgroud)

赠送:

> long
   id        V1        value
1   1     tasty      chicken
7   1     tasty     cinnamon
2   2 not_tasty       butter
8   2 not_tasty       pepper
14  2 not_tasty        onion
20  2 not_tasty     cardamom
26  2 not_tasty      cayenne
3   3     tasty    olive_oil
9   3     tasty       pepper
4   4      okay    olive_oil
10  4      okay        onion
16  4      okay       potato
22  4      okay black_pepper
5   5 not_tasty       tomato
11  5 not_tasty    fenugreek
17  5 not_tasty       pepper
23  5 not_tasty        onion
29  5 not_tasty       potato
6   6     tasty       butter
12  6     tasty       cheese
18  6     tasty        wheat
24  6     tasty          ham
Run Code Online (Sandbox Code Playgroud)

宽格0/1二进制变量

要将变量部分表示为0/1二进制变量,请尝试以下操作:

wide <- cast(id + V1 ~ value, data = long)
wide[-(1:2)] <- 0 + !is.na(wide[-(1:2)])
Run Code Online (Sandbox Code Playgroud)

给这个:

截图

在数据框中列出

不同的表示形式是数据框中的以下列表,因此它ag$value是字符向量列表:

ag <- aggregate(value ~., transform(long, value = as.character(value)), c)
ag <- ag[order(ag$id), ]

giving:

> ag
  id        V1                                    value
4  1     tasty                        chicken, cinnamon
1  2 not_tasty butter, pepper, onion, cardamom, cayenne
5  3     tasty                        olive_oil, pepper
3  4      okay   olive_oil, onion, potato, black_pepper
2  5 not_tasty tomato, fenugreek, pepper, onion, potato
6  6     tasty               butter, cheese, wheat, ham

> str(ag)
'data.frame':   6 obs. of  3 variables:
 $ id   : int  1 2 3 4 5 6
 $ V1   : chr  "tasty" "not_tasty" "tasty" "okay" ...
 $ value:List of 6
  ..$ 15: chr  "chicken" "cinnamon"
  ..$ 1 : chr  "butter" "pepper" "onion" "cardamom" ...
  ..$ 17: chr  "olive_oil" "pepper"
  ..$ 11: chr  "olive_oil" "onion" "potato" "black_pepper"
  ..$ 6 : chr  "tomato" "fenugreek" "pepper" "onion" ...
  ..$ 19: chr  "butter" "cheese" "wheat" "ham"
Run Code Online (Sandbox Code Playgroud)