在目录中导入最新的 csv 文件

Question

在目录中导入最新的 csv 文件

hia*_*ics 5 csv directory import automation r

目标：
- 将本地目录中的最新文件 (.csv) 导入 R

目标详细信息：
- 每天在我的 Mac 上将一个 csv 文件上传到一个文件夹。我希望能够在我的 R 脚本中加入一个函数，该函数会自动将最新文件导入我的工作区以进行进一步分析。该文件每天在凌晨 4:30 左右上传
- 我希望此功能在早上运行（不早于早上 6 点，因此这里有足够的时间留出时间）

输入详细信息：
- 文件类型：.csv
- 命名约定：示例文件名：“28 Jul 2014 04:37:47 -0400.csv”
- 频率：每日导入 @ ~ 04:30

我尝试过的：
- 我知道这似乎是一个微弱的尝试，但我真的不知道如何修改下面的这个功能。
- 我在纸上的想法是“抓取”最新文件的 id，而不是在目录名称前面粘贴（），然后是中提琴！（但可惜我的编程技能缺乏在这里编写代码）
- 下面的代码是尝试运行的代码，但它只是“挂起”并且没有完成。我从这个R 论坛得到了这个代码，在这里找到

代码：

lastChange = file.info(directory)$mtime 
while(TRUE){ 
  currentM = file.info(directory)$mtime 
  if(currentM != lastChange){ 
    lastChange = currentM 
    read.csv(directory) 
  } 
  # try again in 10 minutes 
  Sys.sleep(600) 
}

Run Code Online (Sandbox Code Playgroud)

我的环境：
- R 3.1
- Mac OS X 10.9.4（小牛队）

非常感谢您提供任何帮助！:-)

Answer 1

mzu*_*uba 8

更有效的解决方案使用dplyr使用/magrittr

\n

pacman::p_load(magrittr)\n\npath <- list.files(path = directory,\n                   pattern = "csv$",\n                   full.names = TRUE) %>%\n  extract(which.max(file.mtime(.)))\n

Run Code Online (Sandbox Code Playgroud)\n

\xe2\x80\xa6 我喜欢它，因为它很容易直观地理解代码的作用：List.files以及extract带有max file.mtime.

\n

请注意，根据是否magrittr覆盖dplyr或反之亦然，您可能需要使用extract2而不是extract.

\n

Answer 2

and*_*rew 5

-- 读取文件.R --

files <- file.info(list.files(directory))
read.csv(rownames(files)[order(files$mtime)][nrow(files)])

Run Code Online (Sandbox Code Playgroud)

我会把上面的脚本放在一个 cron 作业中，该作业每天早上在写完当天的文件时运行。下面的 crontab 每天早上 8 点运行它。

-- 在 crontab 中 --

0 8 * * *  Rscript readfile.R

Run Code Online (Sandbox Code Playgroud)

在此处阅读有关 cron 的更多信息。

Answer 3

r2e*_*ans 1

以下函数使用时间戳文件来“跟踪”已使用时间戳文件处理的文件。它可以在 R 实例中连续运行（正如您首先建议的那样），也可以通过单次运行实例的方式运行，借用 @andrew 对 cron 作业的建议。（该cat()命令主要用于测试；请随意删除它。）

processDir <- function(directory = '.', pattern = '*.csv', loop = FALSE, delay = 600,
                       stampFile = file.path(directory, '.csvProcessor')) {
    if (! file.exists(stampFile))
        file.create(stampFile)
    firstRun <- TRUE
    while (firstRun || loop) {
        firstRun <- FALSE
        stampTime <- file.info(stampFile)$mtime
        allFilesDF <- file.info(list.files(path = directory, pattern = pattern,
                                           full.names = TRUE, no.. = TRUE))
        unprocessedFiles <- allFilesDF[(! allFilesDF$isdir) &
                                       (allFilesDF$mtime > stampTime), ]
        if (nrow(unprocessedFiles)) {
            ## We need to update the timestamp on stampFile quickly so
            ## that files added while this is running will be found in the
            ## next loop.
            ## WARNING: this blindly truncates the stampFile.
            file.create(stampFile, showWarnings = FALSE)
            for (fn in rownames(unprocessedFiles)) {
                cat('Processing ', fn, '\n')
                ## read.csv(fn)
                ## ...
            }
        }
        if (loop) Sys.sleep(delay)
    }
}

Run Code Online (Sandbox Code Playgroud)

正如您最初建议的那样，在持续运行的 R 实例中运行它只需：

processDir(loop = TRUE)

Run Code Online (Sandbox Code Playgroud)

要使用 @andrew 对 cron 作业的建议，请在函数定义后附加以下行：

processDir()

Run Code Online (Sandbox Code Playgroud)

...并使用类似于以下内容的 crontab 文件：

# crontab
0 8 * * * path/to/Rscript path/to/processDir.R

Run Code Online (Sandbox Code Playgroud)

希望这可以帮助。

归档时间：	11 年，5 月前
查看次数：	2714 次
最近记录：	5 年，7 月前