使用OpenGL和GLSL的SSAO算法的奇怪性能行为

Question

使用OpenGL和GLSL的SSAO算法的奇怪性能行为

我正在使用Oriented-Hemisphere渲染技术研究SSAO(屏幕空间环境遮挡)算法.

I)算法

该算法需要输入:

1个包含预先计算样本的数组(在主循环之前加载 - >在我的例子中,我使用64个根据z轴定向的样本).
1噪声纹理包含也根据z轴定向的归一化旋转矢量(此纹理生成一次).
来自GBuffer的2个纹理:'PositionSampler'和'NormalSampler',包含视图空间中的位置和法向量.

这是我使用的片段着色器源代码:

#version 400

/*
** Output color value.
*/
layout (location = 0) out vec4 FragColor;

/*
** Vertex inputs.
*/
in VertexData_VS
{
    vec2 TexCoords;

} VertexData_IN;

/*
** Inverse Projection Matrix.
*/
uniform mat4 ProjMatrix;

/*
** GBuffer samplers.
*/
uniform sampler2D PositionSampler;
uniform sampler2D NormalSampler;

/*
** Noise sampler.
*/
uniform sampler2D NoiseSampler;

/*
** Noise texture viewport.
*/
uniform vec2 NoiseTexOffset;

/*
** Ambient light intensity.
*/
uniform vec4 AmbientIntensity;

/*
** SSAO kernel + size.
*/
uniform vec3 SSAOKernel[64];
uniform uint SSAOKernelSize;
uniform float SSAORadius;

/*
** Computes Orientation matrix.
*/
mat3 GetOrientationMatrix(vec3 normal, vec3 rotation)
{
    vec3 tangent = normalize(rotation - normal * dot(rotation, normal)); //Graham Schmidt process 
    vec3 bitangent = cross(normal, tangent);

    return (mat3(tangent, bitangent, normal)); //Orientation according to the normal
}

/*
** Fragment shader entry point.
*/
void main(void)
{
    float OcclusionFactor = 0.0f;

    vec3 gNormal_CS = normalize(texture(
        NormalSampler, VertexData_IN.TexCoords).xyz * 2.0f - 1.0f); //Normal vector in view space from GBuffer
    vec3 rotationVec = normalize(texture(NoiseSampler,
        VertexData_IN.TexCoords * NoiseTexOffset).xyz * 2.0f - 1.0f); //Rotation vector required for Graham Schmidt process

    vec3 Origin_VS = texture(PositionSampler, VertexData_IN.TexCoords).xyz; //Origin vertex in view space from GBuffer
    mat3 OrientMatrix = GetOrientationMatrix(gNormal_CS, rotationVec);

    for (int idx = 0; idx < SSAOKernelSize; idx++) //For each sample (64 iterations)
    {
        vec4 Sample_VS = vec4(Origin_VS + OrientMatrix * SSAOKernel[idx], 1.0f); //Sample translated in view space

        vec4 Sample_HS = ProjMatrix * Sample_VS; //Sample in homogeneus space
        vec3 Sample_CS = Sample_HS.xyz /= Sample_HS.w; //Perspective dividing (clip space)
        vec2 texOffset = Sample_CS.xy * 0.5f + 0.5f; //Recover sample texture coordinates

        vec3 SampleDepth_VS = texture(PositionSampler, texOffset).xyz; //Sample depth in view space

        if (Sample_VS.z < SampleDepth_VS.z)
            if (length(Sample_VS.xyz - SampleDepth_VS) <= SSAORadius)
                OcclusionFactor += 1.0f; //Occlusion accumulation
    }
    OcclusionFactor = 1.0f - (OcclusionFactor / float(SSAOKernelSize));

    FragColor = vec4(OcclusionFactor);
    FragColor *= AmbientIntensity;
}

Run Code Online (Sandbox Code Playgroud)

这是结果(没有模糊渲染传递):

直到这里所有似乎都是正确的.

II)表现

我注意到NSight Debugger在性能上有一个非常奇怪的行为:

如果我将相机移近龙,那么表演会受到严重影响.

但是,在我看来,应该不是这种情况,因为SSAO算法适用于屏幕空间,并且不依赖于龙的基元数量.

这是3个不同摄像机位置的3个屏幕截图(使用所有相同的算法执行3个案例所有1024*768像素着色器):

a)GPU空闲:40%(受影响的像素:100%)

b)GPU空闲:25%(受影响的像素:100%)

c)GPU空闲:2%!(受影响的像素:100%)

我的渲染引擎在我的示例中使用exaclly 2渲染过程:

该材料通(填充位置和正常采样器)
的环境通(填充SSAO纹理)

我认为这个问题来自于添加这两个传递的执行情况但事实并非如此,因为我在我的客户端代码中添加了一个条件,如果相机静止,则不会计算材料通过的任何情况.因此,当我拍摄上面的这3张照片时,只有Ambient Pass被执行了.所以这种缺乏表现与材料无关.我可以给你的另一个论点是,如果我移除龙网格(只有平面的场景)结果是相同的:更多我的相机接近飞机,更多的缺乏性能是巨大的!

对我来说这种行为是不合逻辑的!就像我上面所说的,在这3种情况下,所有像素着色器都执行完全相同的像素着色器代码!

现在我注意到另一个奇怪的行为,如果我直接在片段着色器中更改了一小段代码:

如果我更换线路:

FragColor = vec4(OcclusionFactor);

Run Code Online (Sandbox Code Playgroud)

按行:

FragColor = vec4(1.0f, 1.0f, 1.0f, 1.0f);

Run Code Online (Sandbox Code Playgroud)

缺乏表现消失了!

这意味着如果SSAO代码被正确执行(我试图在执行期间放置一些断点来检查它)并且我最后不使用这个OcclusionFactor来填充最终的输出颜色,所以不乏性能!

我想我们可以得出结论,问题不是来自"FragColor = vec4(OcclusionFactor);"行之前的着色器代码.... 我认为.

你怎么解释这种行为？

我在客户端代码和片段着色器代码中尝试了很多代码组合,但我找不到解决这个问题的方法!我真的迷路了.

非常感谢您的帮助!

Answer 1

dar*_*ari 5

简短的回答是缓存效率.

要理解这一点,让我们看一下内循环中的以下几行:

    vec4 Sample_VS = vec4(Origin_VS + OrientMatrix * SSAOKernel[idx], 1.0f); //Sample translated in view space

    vec4 Sample_HS = ProjMatrix * Sample_VS; //Sample in homogeneus space
    vec3 Sample_CS = Sample_HS.xyz /= Sample_HS.w; //Perspective dividing (clip space)
    vec2 texOffset = Sample_CS.xy * 0.5f + 0.5f; //Recover sample texture coordinates

    vec3 SampleDepth_VS = texture(PositionSampler, texOffset).xyz; //Sample depth in view space

Run Code Online (Sandbox Code Playgroud)

你在这做的是:

在视图空间中翻译orignal点
将其转换为剪辑空间
对纹理进行采样

那么它如何与缓存效率相对应呢？

访问相邻像素时,缓存可以很好地工作.例如,如果您使用的是高斯模糊,则只访问邻居,这些邻居很可能已经加载到缓存中.

所以我们说你的对象现在很遥远.然后在剪辑空间中采样的像素也非常接近原点 - >高位置 - >良好的缓存性能.

如果相机非常靠近您的物体,则生成的采样点距离较远(在剪辑空间中),您将获得随机存储器访问模式.尽管你实际上并没有做更多的操作,但这会大大降低你的表现.

编辑:

为了提高性能,您可以从上一遍的深度缓冲区重建视图空间位置.

如果您使用的是32位深度缓冲区,则会将一个样本所需的数据量从12字节减少到4字节.

位置重建看起来像这样:

vec4 reconstruct_vs_pos(vec2 tc){
  float depth = texture(depthTexture,tc).x;
  vec4 p = vec4(tc.x,tc.y,depth,1) * 2.0f + 1.0f; //tranformed to unit cube [-1,1]^3
  vec4 p_cs = invProj * p; //invProj: inverse projection matrix (pass this by uniform)
  return p_cs / p_cs.w;
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，2 月前
查看次数：	796 次
最近记录：	10 年，2 月前