如何在目录中超过 1000 万个文件上运行 sed？

Question

如何在目录中超过 1000 万个文件上运行 sed？

San*_*dro 17 bash find xargs

我有一个目录，里面有 10144911 个文件。到目前为止，我已经尝试了以下方法：

for f in ls; do sed -i -e 's/blah/blee/g' $f; done

撞毁了我的外壳，ls它在一个 tilda 中，但我不知道如何制作一个。

ls | xargs -0 sed -i -e 's/blah/blee/g'

太多的参数 sed

find . -name "*.txt" -exec sed -i -e 's/blah/blee/g' {} \;

不能再分叉没有更多的记忆

关于如何创建这种命令的任何其他想法？这些文件不需要相互通信。ls | wc -l似乎工作（非常慢）所以它必须是可能的。

Answer 1

Den*_*son 20

试试这个：

find -name '*.txt' -print0 | xargs -0 -I {} -P 0 sed -i -e 's/blah/blee/g' {}

Run Code Online (Sandbox Code Playgroud)

它只会为每次调用提供一个文件名sed。这将解决“sed 参数过多”的问题。该-P选项应允许同时分叉多个进程。如果 0 不起作用（应该尽可能多地运行），请尝试其他数字（10？100？您拥有的核心数？）以限制数量。

可能，它需要是 `find 。-name \*.txt -print0` 以避免让 shell 扩展 glob 并尝试为 *find* 分配 1000 万个参数的空间。 (3认同)

Answer 2

Pet*_*r.O 8

我已经在1000 万个（空）文件上测试了这个方法（以及所有其他方法），文件名为“hello 00000001”到“hello 10000000”（每个名称 14 个字节）。

更新： 我现在在该方法上包含了一个四核运行'find |xargs'（仍然没有“sed”；只是 echo >/dev/null）。

# Step 1. Build an array for 10 million files
#   * RAM usage approx:  1.5 GiB 
#   * Elapsed Time:  2 min 29 sec 
  names=( hello\ * )

# Step 2. Process the array.
#   * Elapsed Time:  7 min 43 sec
  for (( ix=0, cnt=${#names[@]} ; ix<$cnt; ix++ )) ; do echo "${names[ix]}" >/dev/null ; done

Run Code Online (Sandbox Code Playgroud)

以下是针对上述测试数据运行时所提供答案的总结。这些结果只涉及基本开销；即没有调用“sed”。sed 过程几乎肯定是最耗时的，但我认为看看裸方法如何比较会很有趣。

Dennis 的'find |xargs'方法使用单核，比运行时的bash array方法长 *4 小时 21 分钟** no sed……但是，'find' 提供的多核优势应该超过调用 sed 时显示的时间差异处理文件...

           | Time    | RAM GiB | Per loop action(s). / The command line. / Notes
-----------+---------+---------+----------------------------------------------------- 
Dennis     | 271 min | 1.7 GiB | * echo FILENAME >/dev/null
Williamson   cores: 1x2.66 MHz | $ time find -name 'hello *' -print0 | xargs -0 -I {} echo >/dev/null {}
                               | Note: I'm very surprised at how long this took to run the 10 million file gauntlet
                               |       It started processing almost immediately (because of xargs I suppose),  
                               |       but it runs **significantly slower** than the only other working answer  
                               |       (again, probably because of xargs) , but if the multi-core feature works  
                               |       and I would think that it does, then it could make up the defecit in a 'sed' run.   
           |  76 min | 1.7 GiB | * echo FILENAME >/dev/null
             cores: 4x2.66 MHz | $ time find -name 'hello *' -print0 | xargs -0 -I {} -P 0 echo >/dev/null {}
                               |  
-----------+---------+---------+----------------------------------------------------- 
fred.bear  | 10m 12s | 1.5 GiB | * echo FILENAME >/dev/null
                               | $ time names=( hello\ * ) ; time for (( ix=0, cnt=${#names[@]} ; ix<$cnt; ix++ )) ; do echo "${names[ix]}" >/dev/null ; done
-----------+---------+---------+----------------------------------------------------- 
l0b0       | ?@#!!#  | 1.7 GiB | * echo FILENAME >/dev/null 
                               | $ time  while IFS= read -rd $'\0' path ; do echo "$path" >/dev/null ; done < <( find "$HOME/junkd" -type f -print0 )
                               | Note: It started processing filenames after 7 minutes.. at this point it  
                               |       started lots of disk thrashing.  'find' was using a lot of memory, 
                               |       but in its basic form, there was no obvious advantage... 
                               |       I pulled the plug after 20 minutes.. (my poor disk drive :(
-----------+---------+---------+----------------------------------------------------- 
intuited   | ?@#!!#  |         | * print line (to see when it actually starts processing, but it never got there!)
                               | $ ls -f hello * | xargs python -c '
                               |   import fileinput
                               |   for line in fileinput.input(inplace=True):
                               |       print line ' 
                               | Note: It failed at 11 min and approx 0.9 Gib
                               |       ERROR message: bash: /bin/ls: Argument list too long  
-----------+---------+---------+----------------------------------------------------- 
Reuben L.  | ?@#!!#  |         | * One var assignment per file
                               | $ ls | while read file; do x="$file" ; done 
                               | Note: It bombed out after 6min 44sec and approx 0.8 GiB
                               |       ERROR message: ls: memory exhausted
-----------+---------+---------+-----------------------------------------------------

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，11 月前
查看次数：	9263 次
最近记录：	14 年，11 月前