如何在目录中超过 1000 万个文件上运行 sed?

San*_*dro 17 bash find xargs

我有一个目录,里面有 10144911 个文件。到目前为止,我已经尝试了以下方法:

  • for f in ls; do sed -i -e 's/blah/blee/g' $f; done

撞毁了我的外壳,ls它在一个 tilda 中,但我不知道如何制作一个。

  • ls | xargs -0 sed -i -e 's/blah/blee/g'

太多的参数 sed

  • find . -name "*.txt" -exec sed -i -e 's/blah/blee/g' {} \;

不能再分叉没有更多的记忆

关于如何创建这种命令的任何其他想法?这些文件不需要相互通信。ls | wc -l似乎工作(非常慢)所以它必须是可能的。

Den*_*son 20

试试这个:

find -name '*.txt' -print0 | xargs -0 -I {} -P 0 sed -i -e 's/blah/blee/g' {}
Run Code Online (Sandbox Code Playgroud)

它只会为每次调用提供一个文件名sed。这将解决“sed 参数过多”的问题。该-P选项应允许同时分叉多个进程。如果 0 不起作用(应该尽可能多地运行),请尝试其他数字(10?100?您拥有的核心数?)以限制数量。

  • 可能,它需要是 `find 。-name \*.txt -print0` 以避免让 shell 扩展 glob 并尝试为 *find* 分配 1000 万个参数的空间。 (3认同)

Pet*_*r.O 8

我已经在1000 万个(空)文件上测试了这个方法(以及所有其他方法),文件名为“hello 00000001”到“hello 10000000”(每个名称 14 个字节)。

更新: 我现在在该方法上包含了一个四核运行'find |xargs'(仍然没有“sed”;只是 echo >/dev/null)。

# Step 1. Build an array for 10 million files
#   * RAM usage approx:  1.5 GiB 
#   * Elapsed Time:  2 min 29 sec 
  names=( hello\ * )

# Step 2. Process the array.
#   * Elapsed Time:  7 min 43 sec
  for (( ix=0, cnt=${#names[@]} ; ix<$cnt; ix++ )) ; do echo "${names[ix]}" >/dev/null ; done  
Run Code Online (Sandbox Code Playgroud)

以下是针对上述测试数据运行时所提供答案的总结。这些结果只涉及基本开销;即没有调用“sed”。sed 过程几乎肯定是最耗时的,但我认为看看裸方法如何比较会很有趣。

Dennis 的'find |xargs'方法使用单核,比运行时的bash array方法长 *4 小时 21 分钟** no sed……但是,'find' 提供的多核优势应该超过调用 sed 时显示的时间差异处理文件...

           | Time    | RAM GiB | Per loop action(s). / The command line. / Notes
-----------+---------+---------+----------------------------------------------------- 
Dennis     | 271 min | 1.7 GiB | * echo FILENAME >/dev/null
Williamson   cores: 1x2.66 MHz | $ time find -name 'hello *' -print0 | xargs -0 -I {} echo >/dev/null {}
                               | Note: I'm very surprised at how long this took to run the 10 million file gauntlet
                               |       It started processing almost immediately (because of xargs I suppose),  
                               |       but it runs **significantly slower** than the only other working answer  
                               |       (again, probably because of xargs) , but if the multi-core feature works  
                               |       and I would think that it does, then it could make up the defecit in a 'sed' run.   
           |  76 min | 1.7 GiB | * echo FILENAME >/dev/null
             cores: 4x2.66 MHz | $ time find -name 'hello *' -print0 | xargs -0 -I {} -P 0 echo >/dev/null {}
                               |  
-----------+---------+---------+----------------------------------------------------- 
fred.bear  | 10m 12s | 1.5 GiB | * echo FILENAME >/dev/null
                               | $ time names=( hello\ * ) ; time for (( ix=0, cnt=${#names[@]} ; ix<$cnt; ix++ )) ; do echo "${names[ix]}" >/dev/null ; done
-----------+---------+---------+----------------------------------------------------- 
l0b0       | ?@#!!#  | 1.7 GiB | * echo FILENAME >/dev/null 
                               | $ time  while IFS= read -rd $'\0' path ; do echo "$path" >/dev/null ; done < <( find "$HOME/junkd" -type f -print0 )
                               | Note: It started processing filenames after 7 minutes.. at this point it  
                               |       started lots of disk thrashing.  'find' was using a lot of memory, 
                               |       but in its basic form, there was no obvious advantage... 
                               |       I pulled the plug after 20 minutes.. (my poor disk drive :(
-----------+---------+---------+----------------------------------------------------- 
intuited   | ?@#!!#  |         | * print line (to see when it actually starts processing, but it never got there!)
                               | $ ls -f hello * | xargs python -c '
                               |   import fileinput
                               |   for line in fileinput.input(inplace=True):
                               |       print line ' 
                               | Note: It failed at 11 min and approx 0.9 Gib
                               |       ERROR message: bash: /bin/ls: Argument list too long  
-----------+---------+---------+----------------------------------------------------- 
Reuben L.  | ?@#!!#  |         | * One var assignment per file
                               | $ ls | while read file; do x="$file" ; done 
                               | Note: It bombed out after 6min 44sec and approx 0.8 GiB
                               |       ERROR message: ls: memory exhausted
-----------+---------+---------+----------------------------------------------------- 
Run Code Online (Sandbox Code Playgroud)