在R中读取Excel:如何在凌乱的电子表格中找到起始单元格

leb*_*noz 12 excel r

我正在尝试编写R代码来从一堆旧电子表格中读取数据.数据的确切位置因工作表而异:唯一的常数是第一列是日期,第二列是"每月返回"作为标题.在此示例中,数据从单元格B5开始:

样本电子表格

如何使用R 自动搜索Excel单元格以查找"每月返回"字符串?

目前,我能想到的最好的想法是从单元格A1开始上传R中的所有内容,并在生成的(巨大的)矩阵中找出混乱.我希望有一个更优雅的解决方案

Raf*_*yas 8

我还没有找到一种方法来优雅地做到这一点,但我对这个问题非常熟悉(从FactSet PA报告中获取数据 - > Excel - > R,对吧?).我理解不同的报告有不同的格式,这可能是一种痛苦.

对于烦人格式的电子表格的略有不同版本,我执行以下操作.它不是最优雅的(它需要两次读取文件),但它的工作原理.我喜欢读取文件两次,以确保列的类型正确,并且标题很好.很容易搞乱列导入,所以我宁愿让我的代码读取文件两次而不是自己清理列,而read_excel默认值,如果你从右边开始,则非常好.

此外,值得注意的是,截至今天(2017-04-20),readxl有一个更新.我安装了新版本,看看是否会让这很容易,但我不相信情况,尽管我可能会弄错.

library(readxl)
library(stringr)
library(dplyr)

f_path <- file.path("whatever.xlsx")

if (!file.exists(f_path)) {
  f_path <- file.choose()
}

# I read this twice, temp_read to figure out where the data actually starts...

# Maybe you need something like this - 
#   excel_sheets <- readxl::excel_sheets(f_path)
#   desired_sheet <- which(stringr::str_detect(excel_sheets,"2 Factor Brinson Attribution"))
desired_sheet <- 1
temp_read <- readxl::read_excel(f_path,sheet = desired_sheet)

skip_rows <- NULL
col_skip <- 0
search_string <- "Monthly Returns"
max_cols_to_search <- 10
max_rows_to_search <- 10

# Note, for the - 0, you may need to add/subtract a row if you end up skipping too far later.
while (length(skip_rows) == 0) {
  col_skip <- col_skip + 1
  if (col_skip == max_cols_to_search) break
  skip_rows <- which(stringr::str_detect(temp_read[1:max_rows_to_search,col_skip][[1]],search_string)) - 0

}

# ... now we re-read from the known good starting point.
real_data <- readxl::read_excel(
  f_path,
  sheet = desired_sheet,
  skip = skip_rows
)

# You likely don't need this if you start at the right row
# But given that all weird spreadsheets are weird in their own way
# You may want to operate on the col_skip, maybe like so:
# real_data <- real_data %>%
#   select(-(1:col_skip))
Run Code Online (Sandbox Code Playgroud)


Big*_*ist 7

好的,在为xls指定格式时,从csv更新到正确建议的xls加载.

library(readxl)
data <- readxl::read_excel(".../sampleData.xls", col_types = FALSE)
Run Code Online (Sandbox Code Playgroud)

你会得到类似的东西:

data <- structure(list(V1 = structure(c(6L, 5L, 3L, 7L, 1L, 4L, 2L), .Label = c("", 
"Apr 14", "GROSS PERFROANCE DETAILS", "Mar-14", "MC Pension Fund", 
"MY COMPANY PTY LTD", "updated by JS on 6/4/2017"), class = "factor"), 
    V2 = structure(c(1L, 1L, 1L, 1L, 4L, 3L, 2L), .Label = c("", 
    "0.069%", "0.907%", "Monthly return"), class = "factor")), .Names = c("V1", 
"V2"), class = "data.frame", row.names = c(NA, -7L))
Run Code Online (Sandbox Code Playgroud)

然后你可以动态过滤"月回报"单元格并识别你的矩阵.

targetCell <- which(data == "Monthly return", arr.ind = T)
returns <- data[(targetCell[1] + 1):nrow(data), (targetCell[2] - 1):targetCell[2]]
Run Code Online (Sandbox Code Playgroud)