外推 - 以awk为基础

use*_*360 6 linux shell awk

我需要以下帮助:我有一个数据文件(以"\ t"表格分​​隔的列),就像这样 data.dat

    # y1    y2      y3      y4
    17.1685 21.6875 20.2393 26.3158 
Run Code Online (Sandbox Code Playgroud)

对于线性拟合,这些是4个点的x值.四个y值是常数:0, 200, 400, 600.

我可以创建点对的线性拟合(x,y):(x1,y1)=(17.1685,0), (x2,y2)=(21.6875,200), (x3,y3)=(20.2393,400), (x4,y4)=(26.3158,600).

现在我想对其中的三个巴黎进行线性拟合, (x1,y1), (x2,y2), (x3,y3) and (x2,y2), (x3,y3), (x4,y4) and (x1,y1), (x3,y3), (x4,y4) and (x1,y1), (x2,y2), (x4,y4).

如果我有三个线性拟合点,我想知道外推点的x值在这三个拟合点之外的值.

我到目前为止这个awk代码:

#!/usr/bin/awk -f

BEGIN{
  z[1] = 0;
  z[2] = 200;
  z[3] = 400;
  z[4] = 600;
}

{
  split($0,str,"\t");
  n = 0.0;

  for(i=1; i<=NF; i++)
  {
    centr[i] = str[i];
    n += 1.0;
  # printf("%d\t%f\t%.1f\t",i,centr[i],z[i]);
  }
  # print "";

  if (n > 2)
  {
    lsq(n,z,centr);
  }
}

function lsq(n,x,y)
{
  sx  = 0.0
  sy  = 0.0
  sxx = 0.0
  syy = 0.0
  sxy = 0.0
  eps = 0.0

  for (i=1;i<=n;i++)
  {
    sx  += x[i]
    sy  += y[i]
    sxx += x[i]*x[i]
    sxy += x[i]*y[i]
    syy += y[i]*y[i]
  }

  if ( (n==0) || ((n*sxx-sx*sx)==0) )
  {
    next;
  }
#   print "number of data points = " n;
  a = (sxx*sy-sxy*sx)/(n*sxx-sx*sx)
  b = (n*sxy-sx*sy)/(n*sxx-sx*sx)

  for(i=1;i<=n;i++)
  {
    ycalc[i] = a+b*x[i]
    dy[i]    = y[i]-ycalc[i]
    eps     += dy[i]*dy[i]
  }

  print "# Intercept =\t"a"
  print "# Slope     =\t"b"

  for (i=1;i<=n;i++)
  {
    printf("%8g %8g %8g \n",x[i],y[i],ycalc[i])
  }

} # function lsq()
Run Code Online (Sandbox Code Playgroud)

所以,

    If we extrapolate to the place of 4th 
    0   17.1685   <--(x1,y1)
    200 21.6875   <--(x2,y2)
    400 20.2393   <--(x3,y3)
    600 22.7692 <<< (x4 = 600,y1 = 22.7692)

    If we extrapolate to the place of 3th
    0   17.1685   <--(x1,y1)
    200 21.6875   <--(x2,y2)
    400 23.6867 <<< (x3 = 400,y3 = 23.6867)
    600 26.3158   <--(x4,y4)

    0   17.1685
    200 19.35266 <<<
    400 20.2393
    600 26.3158

    0   18.1192 <<<
    200 21.6875
    400 20.2393
    600 26.3158
Run Code Online (Sandbox Code Playgroud)

我目前的输出如下:

$> ./prog.awk data.dat
# Intercept =   17.4537
# Slope     =   0.0129968   
       0  17.1685  17.4537 
     200  21.6875  20.0531 
     400  20.2393  22.6525 
     600  26.3158  25.2518 
Run Code Online (Sandbox Code Playgroud)

Jon*_*ler 4

假设函数中的核心计算lsq没问题(它看起来是正确的,但我还没有仔细检查过),那么这将为您提供最适合输入数据集的最小平方和线的斜率和截距(参数 x ,y,n)。我不确定我是否理解该函数的尾部。

对于“取三个点并计算第四个”问题,最简单的方法是生成 4 个子集(从逻辑上讲,通过在四个调用中的每一个调用中从四个点中删除一个点),然后重做计算。

您需要调用另一个函数,该函数从lsq另一个 y 值获取线数据(斜率、截距)并内插(外推)该值。这是一个直接的计算 ( x = m * y + c),但您需要确定y您传入的 3 个值中缺少哪个值。

您可以通过从“平方和”、“总和”和“乘积之和”值中一次删除一个值,重新计算斜率、截距,然后计算缺失值来“优化”(意思是“复杂化”)该方案再次指点。

(我还会观察到,通常情况下,x 坐标的固定值为 0、200、400、600,y 坐标为读取的值。但是,这只是方向问题,因此不是至关重要的。)


这是至少看似合理的工作代码。由于awk自动在空白处进行分割,因此无需专门在制表符上进行分割;读取循环考虑到了这一点。

代码需要认真重构;其中有大量的重复——然而,我也有一份我应该做的工作。

#!/usr/bin/awk -f
BEGIN{
  z[1] = 0;
  z[2] = 200;
  z[3] = 400;
  z[4] = 600;
}

{
  for (i = 1; i <= NF; i++)
  {
    centr[i] = $i
  }

  if (NF > 2)
  {
    lsq(NF, z, centr);
  }
}

function lsq(n, x, y)
{
  if (n == 0) return

  sx  = 0.0
  sy  = 0.0
  sxx = 0.0
  syy = 0.0
  sxy = 0.0

  for (i = 1; i <= n; i++)
  {
    print "x[" i "] = " x[i] ", y[" i "] = " y[i]
    sx  += x[i]
    sy  += y[i]
    sxx += x[i]*x[i]
    sxy += x[i]*y[i]
    syy += y[i]*y[i]
  }

  if ((n*sxx - sx*sx) == 0) return

#   print "number of data points = " n;
  a = (sxx*sy-sxy*sx)/(n*sxx-sx*sx)
  b = (n*sxy-sx*sy)/(n*sxx-sx*sx)

  for (i = 1; i <= n; i++)
  {
    ycalc[i] = a+b*x[i]
  }

  print "# Intercept = " a
  print "# Slope     = " b
  print "Line: x = " a " + " b " * y"

  for (i = 1; i <= n; i++)
  {
    printf("x = %8g, yo = %8g, yc = %8g\n", x[i], y[i], ycalc[i])
  }

  print ""
  print "Different subsets\n"

  for (drop = 1; drop <= n; drop++)
  {
    print "Subset " drop
    sx = sy = sxx = sxy = syy = 0
    j = 1
    for (i = 1; i <= n; i++)
    {
      if (i == drop) continue
      print "x[" j "] = " x[i] ", y[" j "] = " y[i]
      sx  += x[i]
      sy  += y[i]
      sxx += x[i]*x[i]
      sxy += x[i]*y[i]
      syy += y[i]*y[i]
      j++
    }
    if (((n-1)*sxx - sx*sx) == 0) continue
    a = (sxx*sy-sxy*sx)/((n-1)*sxx-sx*sx)
    b = ((n-1)*sxy-sx*sy)/((n-1)*sxx-sx*sx)
    print "Line: x = " a " + " b " * y"

    xt = x[drop]
    yt = a + b * xt;
    print "Interpolate: x = " xt ", y = " yt
  }
}
Run Code Online (Sandbox Code Playgroud)

由于awk它没有提供从函数传回多个值的简单方法,也没有提供数组以外的结构(有时是关联的),因此它可能不是完成此任务的最佳语言。另一方面,它可以完成这项工作。您也许可以将最小二乘计算捆绑在一个函数中,该函数返回包含斜率和截距的数组,然后使用它。轮到你探索选项了。

给定显示的脚本lsq.awk和输入文件lsq.data,我得到显示的输出:

$ cat lsq.data
17.1685 21.6875 20.2393 26.3158
$ awk -f lsq.awk lsq.data
x[1] = 0, y[1] = 17.1685
x[2] = 200, y[2] = 21.6875
x[3] = 400, y[3] = 20.2393
x[4] = 600, y[4] = 26.3158
# Intercept = 17.4537
# Slope     = 0.0129968
Line: x = 17.4537 + 0.0129968 * y
x =        0, yo =  17.1685, yc =  17.4537
x =      200, yo =  21.6875, yc =  20.0531
x =      400, yo =  20.2393, yc =  22.6525
x =      600, yo =  26.3158, yc =  25.2518

Different subsets

Subset 1
x[1] = 200, y[1] = 21.6875
x[2] = 400, y[2] = 20.2393
x[3] = 600, y[3] = 26.3158
Line: x = 18.1192 + 0.0115708 * y
Interpolate: x = 0, y = 18.1192
Subset 2
x[1] = 0, y[1] = 17.1685
x[2] = 400, y[2] = 20.2393
x[3] = 600, y[3] = 26.3158
Line: x = 16.5198 + 0.0141643 * y
Interpolate: x = 200, y = 19.3526
Subset 3
x[1] = 0, y[1] = 17.1685
x[2] = 200, y[2] = 21.6875
x[3] = 600, y[3] = 26.3158
Line: x = 17.7985 + 0.0147205 * y
Interpolate: x = 400, y = 23.6867
Subset 4
x[1] = 0, y[1] = 17.1685
x[2] = 200, y[2] = 21.6875
x[3] = 400, y[3] = 20.2393
Line: x = 18.163 + 0.007677 * y
Interpolate: x = 600, y = 22.7692
$
Run Code Online (Sandbox Code Playgroud)

编辑:在答案的先前版本中,子集乘以而n不是(n-1)。修改后的输出中的值似乎与您的期望一致。剩下的问题是表现性的,而不是计算性的。