Is piping, shifting, or parameter expansion more efficient?

Lev*_*ike 27 performance pipe shell-script cut

I'm trying to find the most efficient way to iterate through certain values that are a consistent number of values away from each other in a space separated list of words(I don't want to use an array). For example,

list="1 ant bat 5 cat dingo 6 emu fish 9 gecko hare 15 i j"
Run Code Online (Sandbox Code Playgroud)

So I want to be able to just iterate through list and only access 1,5,6,9 and 15.

EDIT: I should have made it clear that the values I'm trying to get from the list don't have to be different in format from the rest of the list. What makes them special is solely their position in the list(In this case, position 1,4,7...). So the list could be1 2 3 5 9 8 6 90 84 9 3 2 15 75 55 but I'd still want the same numbers. And also, I want to be able to do it assuming I don't know the length of the list.

The methods I've thought of so far are:

Method 1

set $list
found=false
find=9
count=1
while [ $count -lt $# ]; do
    if [ "${@:count:1}" -eq $find ]; then
    found=true
    break
    fi
    count=`expr $count + 3`
done
Run Code Online (Sandbox Code Playgroud)

Method 2

set list
found=false
find=9
while [ $# ne 0 ]; do
    if [ $1 -eq $find ]; then
    found=true
    break
    fi
    shift 3
done
Run Code Online (Sandbox Code Playgroud)

Method 3 I'm pretty sure piping makes this the worst option, but I was trying to find a method that doesn't use set, out of curiosity.

found=false
find=9
count=1
num=`echo $list | cut -d ' ' -f$count`
while [ -n "$num" ]; do
    if [ $num -eq $find ]; then
    found=true
    break
    fi
    count=`expr $count + 3`
    num=`echo $list | cut -d ' ' -f$count`
done
Run Code Online (Sandbox Code Playgroud)

So what would be most efficient, or am I missing a simpler method?

ilk*_*chu 36

  • First rule of software optimization: Don't.

    Until you know the speed of the program is an issue, there's no need to think about how fast it is. If your list is about that length or just ~100-1000 items long, you probably won't even notice how long it takes. There's a chance you're spending more time thinking about the optimization than what the difference would be.

  • Second rule: Measure.

    That's the sure way to find out, and the one that gives answers for your system. Especially with shells, there are so many, and they aren't all identical. An answer for one shell might not apply for yours.

    In larger programs, profiling goes here too. The slowest part might not be the one you think it is.

  • Third, the first rule of shell script optimization: Don't use the shell.

    Yeah, really. Many shells aren't made to be fast (since launching external programs doesn't have to be), and they might even parse the lines of the source code again each time.

    Use something like awk or Perl instead. In a trivial micro-benchmark I did, awk was dozens of times faster than any common shell in running a simple loop (without I/O).

    但是,如果您确实使用 shell,请使用 shell 的内置函数而不是外部命令。在这里,您使用的exprwhich 不是我在系统上找到的任何 shell 中内置的,但可以用标准算术扩展替换。例如,i=$((i+1))而不是i=$(expr $i + 1)增加i。您cut在最后一个示例中使用的也可以替换为标准参数扩展。

    另请参阅:为什么使用 shell 循环处理文本被认为是不好的做法?

步骤 #1 和 #2 应该适用于您的问题。

  • #0,引用你的扩展:-) (12认同)
  • 并不是说`awk` 循环就一定比shell 循环更好或更差。shell 真的很擅长_运行命令_以及在进程之间引导输入和输出,坦率地说,在其他一切方面都相当笨拙;而像`awk`这样的工具在处理文本数据方面非常出色,因为这就是shell和像`awk`这样的工具(分别)最初的用途。 (8认同)
  • @DopeGhoti,虽然客观上来说,炮弹似乎确实更慢。一些非常简单的 while 循环似乎在 `dash` 中比在 `gawk` 中慢 25 倍以上,而 `dash` 是我测试过的最快的 shell... (2认同)
  • 关于*“更多时间思考”*:这忽略了一个重要因素。它运行的频率是多少?有多少用户?如果一个程序浪费了 1 秒,程序员思考 30 分钟就可以解决这个问题,但如果只有一个用户要运行它一次,那么可能会浪费时间。另一方面,如果有 100 万用户,则为 100 万秒,即 11 天的用户时间。如果代码浪费了 100 万用户的一分钟,则大约相当于 **2 年**的用户时间。 (2认同)

Dop*_*oti 18

Pretty simple with awk. This will get you the value of every fourth field for input of any length:

$ awk -F' ' '{for( i=1;i<=NF;i+=3) { printf( "%s%s", $i, OFS ) }; printf( "\n" ) }' <<< $list
1 5 6 9 15
Run Code Online (Sandbox Code Playgroud)

This works be leveraging built-in awk variables such as NF (the number of fields in the record), and doing some simple for looping to iterate along the fields to give you the ones you want without needing to know ahead of time how many there will be.

Or, if you do indeed just want those specific fields as specified in your example:

$ awk -F' ' '{ print $1, $4, $7, $10, $13 }' <<< $list
1 5 6 9 15
Run Code Online (Sandbox Code Playgroud)

As for the question about efficiency, the simplest route would be to test this or each of your other methods and use time to show how long it takes; you could also use tools like strace to see how the system calls flow. Usage of time looks like:

$ time ./script.sh

real    0m0.025s
user    0m0.004s
sys     0m0.008s
Run Code Online (Sandbox Code Playgroud)

You can compare that output between varying methods to see which is the most efficient in terms of time; other tools can be used for other efficiency metrics.

  • @DopeGhoti 实际上确实如此。`&lt;&lt;&lt;` 在末尾添加一个换行符。这类似于 `$()` 如何从末尾删除换行符。这是因为行以换行符终止。`&lt;&lt;&lt;` 将表达式作为一行提供,因此它必须以换行符结尾。`"$()"` 接受行并将它们作为参数提供,因此通过删除终止换行符来进行转换是有意义的。 (5认同)
  • @LeviUzodike awk 是一个被低估的工具。它将使各种看似复杂的问题变得容易解决。特别是当您尝试为 sed 之类的东西编写复杂的正则表达式时,您通常可以通过在 awk 中按程序编写它来节省数小时。学习它会带来巨大的收益。 (3认同)
  • @LeviUzodike 关于`echo` 和`&lt;&lt;&lt;`,“相同”这个词太强了。你可以说`stuff &lt;&lt;&lt; "$list"` 几乎等同于`printf "%s\n" "$list" | 东西`。关于 `echo` 与 `printf`,我引导你到 [这个答案](https://unix.stackexchange.com/a/65819/189744) (2认同)

Gil*_*il' 14

我只会在这个答案中给出一些一般性建议,而不是基准。基准测试是可靠回答有关性能问题的唯一方法。但是由于您没有说明您操作了多少数据以及执行此操作的频率,因此无法进行有用的基准测试。10 个项目的效率更高和 1000000 个项目的效率更高通常是不一样的。

作为一般经验法则,只要纯 shell 代码不涉及循环,调用外部命令比使用纯 shell 构造执行某些操作更昂贵。另一方面,迭代大字符串或大量字符串的 shell 循环可能比调用特殊用途工具慢。例如,您的循环调用cut在实践中可能会明显变慢,但是如果您找到一种通过单个cut调用完成整个事情的方法,这可能比在 shell 中使用字符串操作执行相同的事情要快。

请注意,系统之间的截止点可能会有很大差异。它可能取决于内核、内核调度程序的配置方式、包含外部可执行文件的文件系统、当前 CPU 与内存压力的大小以及许多其他因素。

expr如果您完全关心性能,请不要打电话来执行算术。事实上,根本不要调用expr执行算术。Shell 具有内置算法,比调用expr.

您似乎在使用 bash,因为您使用的是 sh 中不存在的 bash 结构。那么到底为什么不使用数组呢?数组是最自然的解决方案,也可能是最快的。请注意,数组索引从 0 开始。

list=(1 2 3 5 9 8 6 90 84 9 3 2 15 75 55)
for ((count = 0; count += 3; count < ${#list[@]})); do
  echo "${list[$count]}"
done
Run Code Online (Sandbox Code Playgroud)

如果您使用 sh,如果您的系统使用 dash 或 kshsh而不是 bash ,您的脚本可能会更快。如果使用 sh,则不会获得命名数组,但仍会获得位置参数的数组之一,您可以使用set. 要访问直到运行时才知道的位置的元素,您需要使用eval(注意正确引用事物!)。

# List elements must not contain whitespace or ?*\[
list='1 2 3 5 9 8 6 90 84 9 3 2 15 75 55'
set $list
count=1
while [ $count -le $# ]; do
  eval "value=\${$count}"
  echo "$value"
  count=$((count+1))
done
Run Code Online (Sandbox Code Playgroud)

如果您只想访问数组一次并且从左到右(跳过某些值),您可以使用shift变量索引代替。

# List elements must not contain whitespace or ?*\[
list='1 2 3 5 9 8 6 90 84 9 3 2 15 75 55'
set $list
while [ $# -ge 1 ]; do
  echo "$1"
  shift && shift && shift
done
Run Code Online (Sandbox Code Playgroud)

哪种方法更快取决于外壳和元素数量。

另一种可能性是使用字符串处理。它的优点是不使用位置参数,因此您可以将它们用于其他用途。对于大量数据,它会更慢,但对于少量数据,这不太可能产生显着差异。

# List elements must be separated by a single space (not arbitrary whitespace)
list='1 2 3 5 9 8 6 90 84 9 3 2 15 75 55'
while [ -n "$list" ]; do
  echo "${list% *}"
  case "$list" in *\ *\ *\ *) :;; *) break;; esac
  list="${list#* * * }"
done
Run Code Online (Sandbox Code Playgroud)

  • @Joe 实际上,不。如果剩余参数太少,`shift 3` 就会失败。你需要类似`if [ $# -gt 3 ]; 然后移位3;否则设置--; 菲` (2认同)