简单的GLSL卷积着色器非常慢

Question

简单的GLSL卷积着色器非常慢

use*_*704 20 opengl-es glsl filter convolution opengl-es-2.0

我正在尝试在iOS的OpenGL ES2.0中实现2D轮廓着色器.这太疯狂了.如在5fps慢.我已将其跟踪到texture2D()调用.但是,如果没有这些,任何卷积着色器都是可撤消的.我尝试过使用lowp而不是mediump,但是一切都只是黑色,虽然它确实给了另外5fps,但它仍然无法使用.

这是我的片段着色器.

    varying mediump vec4 colorVarying;
    varying mediump vec2 texCoord;

    uniform bool enableTexture;
    uniform sampler2D texture;

    uniform mediump float k;

    void main() {

        const mediump float step_w = 3.0/128.0;
        const mediump float step_h = 3.0/128.0;
        const mediump vec4 b = vec4(0.0, 0.0, 0.0, 1.0);
        const mediump vec4 one = vec4(1.0, 1.0, 1.0, 1.0);

        mediump vec2 offset[9];
        mediump float kernel[9];
        offset[0] = vec2(-step_w, step_h);
        offset[1] = vec2(-step_w, 0.0);
        offset[2] = vec2(-step_w, -step_h);
        offset[3] = vec2(0.0, step_h);
        offset[4] = vec2(0.0, 0.0);
        offset[5] = vec2(0.0, -step_h);
        offset[6] = vec2(step_w, step_h);
        offset[7] = vec2(step_w, 0.0);
        offset[8] = vec2(step_w, -step_h);

        kernel[0] = kernel[2] = kernel[6] = kernel[8] = 1.0/k;
        kernel[1] = kernel[3] = kernel[5] = kernel[7] = 2.0/k;
        kernel[4] = -16.0/k;  

        if (enableTexture) {
              mediump vec4 sum = vec4(0.0);
            for (int i=0;i<9;i++) {
                mediump vec4 tmp = texture2D(texture, texCoord + offset[i]);
                sum += tmp * kernel[i];
            }

            gl_FragColor = (sum * b) + ((one-sum) * texture2D(texture, texCoord));
        } else {
            gl_FragColor = colorVarying;
        }
    }

Run Code Online (Sandbox Code Playgroud)

这是未经优化的,尚未最终确定,但我需要在继续之前提高性能.我已经尝试用一个坚实的vec4替换循环中的texture2D()调用,它运行没有问题,尽管其他一切都在进行.

我该如何优化呢？我知道这是可能的,因为我已经看到更多涉及3D运行效果的问题.我不明白为什么这会造成任何麻烦.

Answer 1

Bra*_*son 46

我自己做了这件事,我看到了几件可以在这里优化的东西.

首先,我将删除enableTexture条件,然后将着色器拆分为两个程序,一个用于真实状态,一个用于假.在iOS片段着色器中,条件是非常昂贵的,特别是在其中具有纹理读取的着色器.

其次,这里有九个依赖纹理读取.这些是纹理读取,其中纹理坐标在片段着色器中计算.iOS设备中的PowerVR GPU上的相关纹理读取非常昂贵,因为它们阻止硬件使用缓存等优化纹理读取.因为您要从8个周围像素和一个中心像素的固定偏移进行采样,所以这些计算应该是向上移动到顶点着色器.这也意味着不必对每个像素执行这些计算,每个顶点只执行一次,然后硬件插值将处理其余的.

第三,到目前为止,iOS着色器编译器还没有很好地处理for()循环,所以我倾向于尽量避免使用.

正如我所提到的,我在我的开源iOS GPUImage框架中完成了这样的卷积着色器.对于通用卷积滤镜,我使用以下顶点着色器:

 attribute vec4 position;
 attribute vec4 inputTextureCoordinate;

 uniform highp float texelWidth; 
 uniform highp float texelHeight; 

 varying vec2 textureCoordinate;
 varying vec2 leftTextureCoordinate;
 varying vec2 rightTextureCoordinate;

 varying vec2 topTextureCoordinate;
 varying vec2 topLeftTextureCoordinate;
 varying vec2 topRightTextureCoordinate;

 varying vec2 bottomTextureCoordinate;
 varying vec2 bottomLeftTextureCoordinate;
 varying vec2 bottomRightTextureCoordinate;

 void main()
 {
     gl_Position = position;

     vec2 widthStep = vec2(texelWidth, 0.0);
     vec2 heightStep = vec2(0.0, texelHeight);
     vec2 widthHeightStep = vec2(texelWidth, texelHeight);
     vec2 widthNegativeHeightStep = vec2(texelWidth, -texelHeight);

     textureCoordinate = inputTextureCoordinate.xy;
     leftTextureCoordinate = inputTextureCoordinate.xy - widthStep;
     rightTextureCoordinate = inputTextureCoordinate.xy + widthStep;

     topTextureCoordinate = inputTextureCoordinate.xy - heightStep;
     topLeftTextureCoordinate = inputTextureCoordinate.xy - widthHeightStep;
     topRightTextureCoordinate = inputTextureCoordinate.xy + widthNegativeHeightStep;

     bottomTextureCoordinate = inputTextureCoordinate.xy + heightStep;
     bottomLeftTextureCoordinate = inputTextureCoordinate.xy - widthNegativeHeightStep;
     bottomRightTextureCoordinate = inputTextureCoordinate.xy + widthHeightStep;
 }

Run Code Online (Sandbox Code Playgroud)

和以下片段着色器:

 precision highp float;

 uniform sampler2D inputImageTexture;

 uniform mediump mat3 convolutionMatrix;

 varying vec2 textureCoordinate;
 varying vec2 leftTextureCoordinate;
 varying vec2 rightTextureCoordinate;

 varying vec2 topTextureCoordinate;
 varying vec2 topLeftTextureCoordinate;
 varying vec2 topRightTextureCoordinate;

 varying vec2 bottomTextureCoordinate;
 varying vec2 bottomLeftTextureCoordinate;
 varying vec2 bottomRightTextureCoordinate;

 void main()
 {
     mediump vec4 bottomColor = texture2D(inputImageTexture, bottomTextureCoordinate);
     mediump vec4 bottomLeftColor = texture2D(inputImageTexture, bottomLeftTextureCoordinate);
     mediump vec4 bottomRightColor = texture2D(inputImageTexture, bottomRightTextureCoordinate);
     mediump vec4 centerColor = texture2D(inputImageTexture, textureCoordinate);
     mediump vec4 leftColor = texture2D(inputImageTexture, leftTextureCoordinate);
     mediump vec4 rightColor = texture2D(inputImageTexture, rightTextureCoordinate);
     mediump vec4 topColor = texture2D(inputImageTexture, topTextureCoordinate);
     mediump vec4 topRightColor = texture2D(inputImageTexture, topRightTextureCoordinate);
     mediump vec4 topLeftColor = texture2D(inputImageTexture, topLeftTextureCoordinate);

     mediump vec4 resultColor = topLeftColor * convolutionMatrix[0][0] + topColor * convolutionMatrix[0][1] + topRightColor * convolutionMatrix[0][2];
     resultColor += leftColor * convolutionMatrix[1][0] + centerColor * convolutionMatrix[1][1] + rightColor * convolutionMatrix[1][2];
     resultColor += bottomLeftColor * convolutionMatrix[2][0] + bottomColor * convolutionMatrix[2][1] + bottomRightColor * convolutionMatrix[2][2];

     gl_FragColor = resultColor;
 }

Run Code Online (Sandbox Code Playgroud)

的texelWidth和texelHeight制服是输入图像的宽度和高度的逆,并且convolutionMatrix均匀的指定在卷积各种样品的权重.

在iPhone 4上,对于640x480帧的相机视频,这在4-8毫秒内运行,这足以在该图像尺寸下进行60 FPS渲染.如果您只需要执行边缘检测等操作,则可以简化上述操作,在预通过时将图像转换为亮度,然后仅从一个颜色通道进行采样.这甚至更快,在同一设备上每帧约2毫秒.

@StevenLu - 一旦你在许多GPU上单次传递超过9次纹理读取,性能会出现惊人的急剧下降.与单程中的样本数相比,将其拆分为两遍会对性能产生非线性影响.我已经测试过,并且在单次传递中运行它比分离内核要快得多,即使对于这么少的样本也是如此. (2认同)

Answer 2

小智 6

我知道减少此着色器所用时间的唯一方法是减少纹理提取的次数.由于着色器从围绕中心像素的等距点采样纹理并线性组合它们,因此可以通过使用GL_LINEAR模式availbale进行纹理采样来减少提取次数.

基本上不是在每个纹素上采样,而是在一对纹素之间进行采样,直接得到线性加权和.

我们将偏移量(-stepw,-steph)和(-stepw,0)的采样分别称为x0和x1.那么你的总和是

sum = x0*k0 + x1*k1

现在,如果你在这两个纹素之间进行采样,距离 k0/(k0+k1)x0并且因此k1/(k0+k1)距离x1,那么GPU将在获取期间执行线性加权并给你,

y = x1*k1/(k0+k1) + x0*k0/(k1+k0)

因此总和可以计算为

sum = y*(k0 + k1) 从一个提取!

如果对其他相邻像素重复此操作,则最终会为每个相邻偏移量执行4次纹理提取,并为中心像素提取一次额外的纹理提取.

该链接更好地解释了这一点

归档时间：	13 年，5 月前
查看次数：	12578 次
最近记录：	13 年，5 月前