cgroups 内存 16GB 上限

Question

cgroups 内存 16GB 上限

我正在尝试使用 cgroups 来限制用户进程在具有大量内存（128 GB 或更多）的服务器上的内存使用。我们想要实现的是为 OS 和 root 进程保留大约 6GB 的 ram，其余的留给用户。我们希望确保我们始终有可用内存，并且我们不希望服务器积极交换。

如果限制设置得足够低（ < 16GB ），这可以正常工作。用户进程被 cgred 正确分配到正确的 cgroup，一旦达到限制，oom 将终止内存饥渴的进程。

当我们将限制设置得更高时，问题就会出现。然后，如果进程使用超过 16G 的 ram，服务器将开始交换，即使内存使用量仍远低于限制并且有足够的 ram 可用。

是否有任何设置或某种最大值会限制我们可以在 cgroups 下授予访问权限的内存量？

这是更多信息：

我使用以下代码来模拟用户进程占用内存。代码在链表中跟踪分配的内存，以便在程序内部使用和访问内存，而不是仅使用 malloc 保留（并每次覆盖指针）。

/* gragram.c 的内容 */

#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>



struct testlink {
  void *ram;
  struct testlink *next;
};

int main (int argc, char *argv[]) {

    int block=8192;
    char buf[block];
    void *ram=NULL;
    FILE *frandom;
    int nbproc,i;
    pid_t pID;
  struct testlink *pstart, *pcurr, *pnew;

    if (argc < 2) {
        //nbproc = 1 by default
        nbproc=1;
    } else {
        if (sscanf(argv[1], "%d", &nbproc) != 1) {
                /* it is an error */
            printf("Failed to set number of child processes\n");
            return -1;
            } 
    }

    // open /dev/urandom for reading
    frandom = fopen("/dev/urandom", "r");
    if ( frandom == NULL ) {
        printf("I can't open /dev/urandom, giving up\n");
        return -1;
    }

    fread(&buf, block, 1, frandom); 
    if ( ferror(frandom) ) {
        // we read less than 1 byte, get out of the loop
        printf("Error reading from urandom\n");
        return -1;
    } 
    fclose (frandom);

    // pID=0 => child pID <0 => error, pID > 0 => parent
    for (i=1; i<nbproc; i++){ 
            pID = fork();
        // break out of the loop  if a child
        if (pID == 0)
            break;
        // exit if fork fails
        if (pID < 0) {
            printf("fork() failed, dying \n");
            return -1;
        }

    }
  pstart = (struct testlink*)malloc(sizeof(struct testlink));
  pstart->ram=NULL;
  pstart->next=NULL;
  pcurr = pstart;

    while ( 1==1 ) {
        ram = (void *)malloc(block);
        if (ram == NULL) {
                    printf("can't allocate memory\n");
                    return -1;
        }

        memcpy(ram, &buf, block);

    // store allocated blocks of ram in a linked list
    // so no one think we are not using them
    pcurr->ram = ram;
    pnew = (struct testlink*)malloc(sizeof(struct testlink));
    pnew->ram=NULL;
    pnew->next=NULL;
    pcurr->next=pnew;
    pcurr=pnew;

    }

    return 0;   

}

Run Code Online (Sandbox Code Playgroud)

到目前为止，我尝试设置以下可调参数：

vm.overcommit_memory
vm.overcommit_ratio
vm.swappiness
vm.dirty_ratio
vm.dirty_background_ratio
vm.vfs_cache_pressure

这些 sysctl 设置似乎都没有任何效果。服务器将在我上面的代码超过 16GB 障碍后开始交换，即使 swappiness 设置为 0，过度使用被禁用等。我什至尝试关闭交换但无济于事。即使没有交换，kswapd 仍然被触发并且性能下降。

最后是cgconfig.conf文件的相关内容

mount {
  cpuset  = /cgroup/computenodes;
  cpu = /cgroup/computenodes;
  memory  = /cgroup/computenodes;
}


#limit = 120G
group computenodes {
# set memory.memsw the same so users can't use swap
  memory {
    memory.limit_in_bytes = 120G;
    memory.memsw.limit_in_bytes = 120G;
    memory.swappiness = 0;
#    memory.use_hierarchy = 1;
  }

# No alternate memory nodes if the system is not NUMA
# On computenodes use all available cores
    cpuset {
        cpuset.mems="0";
        cpuset.cpus="0-47";
    }
}

Run Code Online (Sandbox Code Playgroud)

最后，我们使用 Centos 6，内核 2.6.32。

谢谢

Answer 1

Mat*_*Ife 5

**注：为后人取消删除**

你的问题在这里

# No alternate memory nodes if the system is not NUMA
# On computenodes use all available cores
    cpuset {
        cpuset.mems="0";
        cpuset.cpus="0-47";
    }
}

Run Code Online (Sandbox Code Playgroud)

您只使用一个内存节点。您需要将其设置为使用所有内存节点。

我也认为以下内容也适用，除非您了解以下内容，否则您仍然会看到问题。所以留给后人。

这个问题基本上归结为所使用的硬件。内核有一个启发式方法来确定这个开关的值。这会改变内核如何确定 NUMA 系统上的内存压力。

zone_reclaim_mode:

Zone_reclaim_mode allows someone to set more or less aggressive approaches to
reclaim memory when a zone runs out of memory. If it is set to zero then no
zone reclaim occurs. Allocations will be satisfied from other zones / nodes
in the system.

This is value ORed together of

1   = Zone reclaim on
2   = Zone reclaim writes dirty pages out
4   = Zone reclaim swaps pages

zone_reclaim_mode is set during bootup to 1 if it is determined that pages
from remote zones will cause a measurable performance reduction. The
page allocator will then reclaim easily reusable pages (those page
cache pages that are currently not used) before allocating off node pages.

It may be beneficial to switch off zone reclaim if the system is
used for a file server and all of memory should be used for caching files
from disk. In that case the caching effect is more important than
data locality.

Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
reclaim will write out dirty pages if a zone fills up and so effectively
throttle the process. This may decrease the performance of a single process
since it cannot use all of system memory to buffer the outgoing writes
anymore but it preserve the memory on other nodes so that the performance
of other processes running on other nodes will not be affected.

Allowing regular swap effectively restricts allocations to the local
node unless explicitly overridden by memory policies or cpuset
configurations.

Run Code Online (Sandbox Code Playgroud)

为了让您了解正在发生的事情，内存被分成多个区域，这在 RAM 与特定 CPU 绑定的 NUMA 系统上特别有用。在这些主机中，内存位置可能是性能的一个重要因素。例如，如果内存组 1 和 2 被分配给物理 CPU 0，CPU 1 可以访问它，但代价是锁定 CPU 0 的 RAM，这会导致性能下降。

在 linux 上，分区反映了物理机的 NUMA 布局。每个区域的大小为 16GB。

当前启用区域回收的情况是内核选择在完整区域 (16 GB) 中回收（将脏页写入磁盘、驱逐文件缓存、换出内存），而不是允许进程在另一个区域分配内存zone（这会影响该 CPU 的性能。这就是您注意到 16GB 后交换的原因。

如果你关闭这个值，这应该会改变内核的行为，而不是主动回收区域数据，而是从另一个节点分配。

尝试zone_reclaim_mode通过sysctl -w vm.zone_reclaim_mode=0在您的系统上运行来关闭，然后重新运行您的测试。

请注意，长时间运行的高内存进程在这样的配置上运行并zone_reclaim_mode关闭会随着时间的推移变得越来越昂贵。

如果您允许在许多不同 CPU 上运行许多不同的进程，所有进程都使用大量内存来使用具有空闲页面的任何节点，那么您可以有效地将主机的性能呈现为类似于只有 1 个物理 CPU 的性能。

归档时间：	11 年，11 月前
查看次数：	1010 次
最近记录：	11 年，11 月前