我正在尝试编写R代码来从一堆旧电子表格中读取数据.数据的确切位置因工作表而异:唯一的常数是第一列是日期,第二列是"每月返回"作为标题.在此示例中,数据从单元格B5开始:
如何使用R 自动搜索Excel单元格以查找"每月返回"字符串?
目前,我能想到的最好的想法是从单元格A1开始上传R中的所有内容,并在生成的(巨大的)矩阵中找出混乱.我希望有一个更优雅的解决方案
我还没有找到一种方法来优雅地做到这一点,但我对这个问题非常熟悉(从FactSet PA报告中获取数据 - > Excel - > R,对吧?).我理解不同的报告有不同的格式,这可能是一种痛苦.
对于烦人格式的电子表格的略有不同版本,我执行以下操作.它不是最优雅的(它需要两次读取文件),但它的工作原理.我喜欢读取文件两次,以确保列的类型正确,并且标题很好.很容易搞乱列导入,所以我宁愿让我的代码读取文件两次而不是自己清理列,而read_excel默认值,如果你从右边开始,则非常好.
此外,值得注意的是,截至今天(2017-04-20),readxl有一个更新.我安装了新版本,看看是否会让这很容易,但我不相信情况,尽管我可能会弄错.
library(readxl)
library(stringr)
library(dplyr)
f_path <- file.path("whatever.xlsx")
if (!file.exists(f_path)) {
f_path <- file.choose()
}
# I read this twice, temp_read to figure out where the data actually starts...
# Maybe you need something like this -
# excel_sheets <- readxl::excel_sheets(f_path)
# desired_sheet <- which(stringr::str_detect(excel_sheets,"2 Factor Brinson Attribution"))
desired_sheet <- 1
temp_read <- readxl::read_excel(f_path,sheet = desired_sheet)
skip_rows <- NULL
col_skip <- 0
search_string <- "Monthly Returns"
max_cols_to_search <- 10
max_rows_to_search <- 10
# Note, for the - 0, you may need to add/subtract a row if you end up skipping too far later.
while (length(skip_rows) == 0) {
col_skip <- col_skip + 1
if (col_skip == max_cols_to_search) break
skip_rows <- which(stringr::str_detect(temp_read[1:max_rows_to_search,col_skip][[1]],search_string)) - 0
}
# ... now we re-read from the known good starting point.
real_data <- readxl::read_excel(
f_path,
sheet = desired_sheet,
skip = skip_rows
)
# You likely don't need this if you start at the right row
# But given that all weird spreadsheets are weird in their own way
# You may want to operate on the col_skip, maybe like so:
# real_data <- real_data %>%
# select(-(1:col_skip))
Run Code Online (Sandbox Code Playgroud)
好的,在为xls指定格式时,从csv更新到正确建议的xls加载.
library(readxl)
data <- readxl::read_excel(".../sampleData.xls", col_types = FALSE)
Run Code Online (Sandbox Code Playgroud)
你会得到类似的东西:
data <- structure(list(V1 = structure(c(6L, 5L, 3L, 7L, 1L, 4L, 2L), .Label = c("",
"Apr 14", "GROSS PERFROANCE DETAILS", "Mar-14", "MC Pension Fund",
"MY COMPANY PTY LTD", "updated by JS on 6/4/2017"), class = "factor"),
V2 = structure(c(1L, 1L, 1L, 1L, 4L, 3L, 2L), .Label = c("",
"0.069%", "0.907%", "Monthly return"), class = "factor")), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -7L))
Run Code Online (Sandbox Code Playgroud)
然后你可以动态过滤"月回报"单元格并识别你的矩阵.
targetCell <- which(data == "Monthly return", arr.ind = T)
returns <- data[(targetCell[1] + 1):nrow(data), (targetCell[2] - 1):targetCell[2]]
Run Code Online (Sandbox Code Playgroud)