`lseek` 如何帮助确定文件是否为空？

Question

`lseek` 如何帮助确定文件是否为空？

我看源代码的cat从GNU的coreutils，特别是圆检测。他们正在比较设备和 inode 并且工作正常，但是有一种额外的情况，如果输入为空，他们允许输出为输入。查看代码，这必须是lseek (input_desc, 0, SEEK_CUR) < stat_buf.st_size)部分。我阅读了从中找到的联机帮助页和讨论git blame，但我仍然不太明白为什么lseek需要调用。

这是如何cat检测的要点，如果它会无限耗尽磁盘（请注意，为了简洁起见，还删除了一些错误检查，完整的源代码在上面链接）：

struct stat stat_buf;
fstat(STDOUT_FILENO, &stat_buf);
out_dev = stat_buf.st_dev;
out_ino = stat_buf.st_ino;
out_isreg = S_ISREG (stat_buf.st_mode) != 0;

// ...
// for <infile> in inputs {
    input_desc = open (infile, file_open_mode); // or STDIN_FILENO
    fstat(input_desc, &stat_buf);
    /* Don't copy a nonempty regular file to itself, as that would
       merely exhaust the output device.  It's better to catch this
       error earlier rather than later.  */
    if (out_isreg 
        && stat_buf.st_dev == out_dev && stat_buf.st_ino == out_ino
        && lseek (input_desc, 0, SEEK_CUR) < stat_buf.st_size)         // <--- This is the important line
    {
      // ...
    }
// } (end of for)

Run Code Online (Sandbox Code Playgroud)

我有两种可能的解释，但似乎都有些奇怪。

根据某些标准（posix），文件可能是“空的”，尽管它仍然包含一些信息（用计算st_size）和/lseek或open通过某些默认偏移来尊重这些信息。我不知道为什么会这样，因为空意味着空，对吧？
这种比较确实是两个条件的“巧妙”组合。这对我来说首先是有意义的，因为如果input_desc会STDIN_FILENO并且不会有文件通过管道传输到stdin，lseek则会失败ESPIPE（根据手册页）并返回-1。那么，这整个语句就是lseek(...) == -1 || stat_buf.st_size > 0. 但这不可能是真的，因为此检查仅在设备和 inode 相同的情况下才会发生，并且只有在 a) stdin 和 stdout 指向相同的 pty 时才会发生，但随后out_isreg会是false或 b) stdin 和 stdout 指向同一个文件，但lseek不能返回-1，对吗？

我还编写了一个打印返回值和errno重要部分的小程序，但没有什么特别突出的：

#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <unistd.h>

int main(int argc, char **argv) {
  struct stat out_stat;
  struct stat in_stat;

  if (fstat(STDOUT_FILENO, &out_stat) < 0)
    exit(1);

  printf("this is written to stdout / into the file\n");

  int fd;
  if (argc > 1)
    fd = open(argv[1], O_RDONLY);
  else
    fd = STDIN_FILENO;

  fstat(fd, &in_stat);
  int res = lseek(fd, 0, SEEK_CUR);
  fprintf(stderr,
          "errno after lseek = %d, EBADF = %d, EINVAL = %d, EOVERFLOW = %d, "
          "ESPIPE = %d\n",
          errno, EBADF, EINVAL, EOVERFLOW, ESPIPE);

  fprintf(stderr, "input:\n\tlseek(...) = %d\n\tst_size = %ld\n", res,
          in_stat.st_size);

  printf("outsize is %ld", out_stat.st_size);
}

Run Code Online (Sandbox Code Playgroud)

$ touch empty
$ ./a.out < empty > empty
errno after lseek = 0, EBADF = 9, EINVAL = 22, EOVERFLOW = 75, ESPIPE = 29
input:
        lseek(...) = 0
        st_size = 0
$ echo x > empty
$ ./a.out < empty > empty
errno after lseek = 0, EBADF = 9, EINVAL = 22, EOVERFLOW = 75, ESPIPE = 29
input:
        lseek(...) = 0
        st_size = 0

Run Code Online (Sandbox Code Playgroud)

所以我的研究没有触及我的最终问题：如何lseek帮助确定本示例中的文件是否为空cat？

Answer 1

roo*_*oot 3

这是我对此进行逆向工程的尝试 - 我找不到任何公开讨论来解释为什么lseek()放在那里（GNU coreutils 中没有其他地方这样做）。

指导性问题是：条件何时为lseek (input_desc, 0, SEEK_CUR) < stat_buf.st_size假？

测试用例：

#!/bin/bash
# (edited based on comments)

set -x

# arrange for cat to start off past the end of a non-empty file

echo abcdefghi > /tmp/so/catseek/input
# get the shell to open the input file for reading & writing as file descriptor 7
exec 7<>/tmp/so/catseek/input
# read the whole file via that descriptor (but leave it open)
dd <&7
# ask linux what the current file position of file descriptor 7 is
# should be everything dd read, namely 10 bytes, the size of the file
grep ^pos: /proc/self/fdinfo/7
# run cat, with pre and post content so that we know how to locate the interesting part
# "-" will cause cat to reuse its file descriptor 0 rather than creating a new file descriptor
# the redirections tell the shell to redirect file descriptors 1 and 0 to/from our open file descriptor 7
# which, as you'll remember, already has a file position of 10 bytes
strace -e lseek ./src/cat /tmp/so/catseek/pre - /tmp/so/catseek/post <&7 >&7
# now let's see what's in the file
cat /tmp/so/catseek/input

Run Code Online (Sandbox Code Playgroud)

和：

$ cat /tmp/so/catseek/pre
pre
$ cat /tmp/so/catseek/post
post

Run Code Online (Sandbox Code Playgroud)

cat和lseek (input_desc, 0, SEEK_CUR) < stat_buf.st_size：

+ test.sh:8:echo abcdefghi
+ test.sh:10:exec
+ test.sh:12:dd
abcdefghi
0+1 records in
0+1 records out
10 bytes copied, 2.0641e-05 s, 484 kB/s
+ test.sh:15:grep '^pos:' /proc/self/fdinfo/7
pos:    10
+ test.sh:20:strace -e lseek ./src/cat /tmp/so/catseek/pre - /tmp/so/catseek/post
lseek(0, 0, SEEK_CUR)                   = 14
+++ exited with 0 +++
+ test.sh:22:cat /tmp/so/catseek/input
abcdefghi
pre
post

Run Code Online (Sandbox Code Playgroud)

cat和0 < stat_buf.st_size：

+ test.sh:8:echo abcdefghi
+ test.sh:10:exec
+ test.sh:12:dd
abcdefghi
0+1 records in
0+1 records out
10 bytes copied, 3.6415e-05 s, 275 kB/s
+ test.sh:15:grep '^pos:' /proc/self/fdinfo/7
pos:    10
+ test.sh:20:strace -e lseek ./src/cat /tmp/so/catseek/pre - /tmp/so/catseek/post
./src/cat: -: input file is output file
+++ exited with 1 +++
+ test.sh:22:cat /tmp/so/catseek/input
abcdefghi
pre
post

Run Code Online (Sandbox Code Playgroud)

正如你所看到的，cat启动时，文件位置可能已经在文件结尾之后，仅检查文件大小会跳过cat该文件，但也会触发失败，因为if语句内的代码是：

error (0, 0, _("%s: input file is output file"), infile);
ok = false;
goto contin;

Run Code Online (Sandbox Code Playgroud)

使用lseek()允许cat说“哦，文件是相同的，并且不为空，但是我们的读取仍然会变成空，因为这就是读取过去的 EOF 的工作原理，所以我们可以允许这种情况”。

归档时间：	5 年，1 月前
查看次数：	239 次
最近记录：	5 年，1 月前