sda*_*aza 6 r stata read.table
我需要使用.dct文件读取.dat文件.有人用R做过吗?
格式为:
dictionary {
# how many lines per record
_lines(1)
# start defining the first line
_line(1)
# starting column / storage type / variable name / read format / variable label
_column(1) str8 aid %8s "respondent identifier"
...
}
Run Code Online (Sandbox Code Playgroud)
'阅读格式'如下:
%2f 2 column integer variable
%12s 12 column string variable
%8.2f 8 column number with 2 implied decimal places.
Run Code Online (Sandbox Code Playgroud)
存储类型如下所述:http://www.stata.com/help.cgi?datatypes
用于信息的其他网站:
http://library.columbia.edu/indiv/dssc/technology/stata_write.html
http://www.stata.com/support/faqs/data-management/reading-fixed-format-data/
.dat文件是一组与.dct文件中指定的变量对应的数字.(据推测这是固定宽度列中的数据).
这是一个真实的例子:
.dtc文件 http://goo.gl/qHZOk
stata站点的一个具体示例是:
该.dat文件(本例中为"test.raw")
C1245A101George Costanza
B1223B011Cosmo Kramer
Run Code Online (Sandbox Code Playgroud)
该.dct文件
dictionary using test2.raw {
_column(1) str5 code %5s
_column(2) int call %4f
_column(6) str1 city %1s
_column(7) int neigh %3f
_column(10) str16 name %16s
}
Run Code Online (Sandbox Code Playgroud)
生成的数据文件:
+-----------------------------------------------+
| code call city neigh name |
|-----------------------------------------------|
1. | C1245 1245 A 101 George Costanza |
2. | B1223 1223 B 11 Cosmo Kramer |
+-----------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
A5C*_*2T1 13
@thelatemail是关于如何继续进行的.这是一个小功能,我把它放在一起,让你开始一个更强大的解决方案:
read.dat.dct <- function(dat, dct) {
temp <- readLines(dct)
pattern <- "_column\\(([0-9]+)\\)\\s+([a-z0-9]+)\\s+([a-z0-9_]+)\\s+%([0-9]+).*"
classes <- c("numeric", "character", "character", "numeric")
metadata <- setNames(lapply(1:4, function(x) {
out <- gsub(pattern, paste("\\", x, sep = ""), temp)
out <- gsub("^\\s+|\\s+$|.*\\{|\\}", "", out)
out <- out[out != ""]
class(out) <- classes[x] ; out }),
c("StartPos", "Str", "ColName", "ColWidth"))
read.fwf(dat, widths = metadata[["ColWidth"]],
col.names = metadata[["ColName"]])
}
Run Code Online (Sandbox Code Playgroud)
在错误检查,概括功能等方面还有很多工作要做.例如,此函数不适用于重叠列,如@thelatemail添加到您的问题中的示例中所示.如果出现错误消息,则"StartPos [n] + ColWidth [n]"形式的某些错误检查应该等于"StartPos [n + 1]"可以用来停止读取文件.另外,结果数据的类也可以从函数生成的"元数据"列表中提取,并read.fwf使用colClasses参数进行分配.
这是一个dat文件和一个dct文件来演示:
将以下两行复制并粘贴到文本编辑器中,并将其作为"test.dat"保存在工作目录中.
C1245A101George Costanza
B1223B011Cosmo Kramer
Run Code Online (Sandbox Code Playgroud)
将以下行复制并粘贴到文本编辑器中,并将其作为"test.dct"保存在工作目录中
dictionary using test.dat {
_column(1) str1 code %1s
_column(2) int call %4f
_column(6) str1 city %1s
_column(7) int neigh %3f
_column(10) str16 name %16s
}
Run Code Online (Sandbox Code Playgroud)
现在,运行该功能:
read.dat.dct(dat = "test.dat", dct = "test.dct")
# code call city neigh name
# 1 C 1245 A 101 George Costanza
# 2 B 1223 B 11 Cosmo Kramer
Run Code Online (Sandbox Code Playgroud)
read.dat.dct <- function(dat, dct, labels.included = "no") {
temp <- readLines(dct)
temp <- temp[grepl("_column", temp)]
switch(labels.included,
yes = {
pattern <- "_column\\(([0-9]+)\\)\\s+([a-z0-9]+)\\s+(.*)\\s+%([0-9]+)[a-z]\\s+(.*)"
classes <- c("numeric", "character", "character", "numeric", "character")
N <- 5
NAMES <- c("StartPos", "Str", "ColName", "ColWidth", "ColLabel")
},
no = {
pattern <- "_column\\(([0-9]+)\\)\\s+([a-z0-9]+)\\s+(.*)\\s+%([0-9]+).*"
classes <- c("numeric", "character", "character", "numeric")
N <- 4
NAMES <- c("StartPos", "Str", "ColName", "ColWidth")
})
metadata <- setNames(lapply(1:N, function(x) {
out <- gsub(pattern, paste("\\", x, sep = ""), temp)
out <- gsub("^\\s+|\\s+$", "", out)
out <- gsub('\"', "", out, fixed = TRUE)
class(out) <- classes[x] ; out }), NAMES)
metadata[["ColName"]] <- make.names(gsub("\\s", "", metadata[["ColName"]]))
myDF <- read.fwf(dat, widths = metadata[["ColWidth"]],
col.names = metadata[["ColName"]])
if (labels.included == "yes") {
attr(myDF, "col.label") <- metadata[["ColLabel"]]
}
myDF
}
Run Code Online (Sandbox Code Playgroud)
它如何处理您的数据?
temp <- read.dat.dct(dat = "http://dl.getdropbox.com/u/18116710/21600-0009-Data.txt",
dct = "http://dl.getdropbox.com/u/18116710/21600-0009-Setup.dct",
labels.included = "yes")
dim(temp) # How big is the dataset?
# [1] 180 40
head(temp[, 1:6]) # What do the first few columns & rows look like?
# CASEID AID RRELNO RPREGNO H3PC1.H3PC1 H3PC2.H3PC2
# 1 1 57118381 5 1 1 1
# 2 2 57134970 1 2 1 1
# 3 3 57135078 1 1 1 1
# 4 4 57135078 5 1 1 1
# 5 5 57164981 1 1 7 3
# 6 6 57191909 1 3 1 1
head(attr(temp, "col.label")) # What are the variable labels?
# [1] "CASE IDENTIFICATION NUMBER" "RESPONDENT IDENTIFIER"
# [3] "ROMANTIC RELATIONSHIP NUMBER" "RELATIONSHIP PREGNANCY NUMBER"
# [5] "S23Q1 1 TOLD PARTNER PREGNANT-W3" "S23Q2 MONTHS PREG WHEN TOLD PARTNER-W3"
Run Code Online (Sandbox Code Playgroud)
原始示例怎么样?
read.dat.dct("test.dat", "test.dct", labels.included = "no")
# code call city neigh name
# 1 C 1245 A 101 George Costanza
# 2 B 1223 B 11 Cosmo Kramer
Run Code Online (Sandbox Code Playgroud)
the*_*ail 10
您可以dat使用读取文件,?read.fwf因为.dat数据本质上只是一个固定宽度的数据文件.
请参阅此处 - 组织Messy Notepad数据 - 使用字典文件中的column(X)值.dct作为宽度.
可以使用readLines提取信息来抓取字典文件,然后可以将其传递给调用中的参数read.fwf.
例如:'变量名'与col.names=参数对齐,'存储类型'与colClasses=参数对齐.
尽管如此,还是会有一些人工处理.