Byr*_*ron 38 c++ memory multithreading openmp fragmentation
当使用openmp的parallel for construct分配和释放具有4个或更多线程的随机大小的内存块时,程序似乎开始在测试程序的运行时的后半部分泄漏大量内存.因此,它消耗的内存从1050 MB增加到1500 MB或更多,而无需实际使用额外的内存.
由于valgrind没有显示任何问题,我必须假设看起来像是内存泄漏实际上是内存碎片的强调效果.
有趣的是,如果2个线程分别进行10000次分配,则效果尚未显示,但如果4个线程分别进行5000次分配则显示效果很强.此外,如果分配的块的最大大小减少到256kb(从1mb),则效果会变弱.
重型并发能否强调碎片那么多?或者这更可能是堆中的错误?
构建演示程序是为了从堆中获取总共256 MB的随机大小的内存块,进行5000次分配.如果达到内存限制,则首先分配的块将被释放,直到内存消耗低于限制.执行5000次分配后,将释放所有内存并结束循环.所有这些工作都是针对openmp生成的每个线程完成的.
这种内存分配方案允许我们预计每个线程的内存消耗约为260 MB(包括一些簿记数据).
由于这是您可能想要测试的内容,您可以使用dropbox中的简单makefile下载示例程序.
按原样运行程序时,您应该至少有1400 MB的RAM可用.您可以随意调整代码中的常量以满足您的需求.
为完整起见,实际代码如下:
#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <vector>
#include <deque>
#include <omp.h>
#include <math.h>
typedef unsigned long long uint64_t;
void runParallelAllocTest()
{
// constants
const int NUM_ALLOCATIONS = 5000; // alloc's per thread
const int NUM_THREADS = 4; // how many threads?
const int NUM_ITERS = NUM_THREADS;// how many overall repetions
const bool USE_NEW = true; // use new or malloc? , seems to make no difference (as it should)
const bool DEBUG_ALLOCS = false; // debug output
// pre store allocation sizes
const int NUM_PRE_ALLOCS = 20000;
const uint64_t MEM_LIMIT = (1024 * 1024) * 256; // x MB per process
const size_t MAX_CHUNK_SIZE = 1024 * 1024 * 1;
srand(1);
std::vector<size_t> allocations;
allocations.resize(NUM_PRE_ALLOCS);
for (int i = 0; i < NUM_PRE_ALLOCS; i++) {
allocations[i] = rand() % MAX_CHUNK_SIZE; // use up to x MB chunks
}
#pragma omp parallel num_threads(NUM_THREADS)
#pragma omp for
for (int i = 0; i < NUM_ITERS; ++i) {
uint64_t long totalAllocBytes = 0;
uint64_t currAllocBytes = 0;
std::deque< std::pair<char*, uint64_t> > pointers;
const int myId = omp_get_thread_num();
for (int j = 0; j < NUM_ALLOCATIONS; ++j) {
// new allocation
const size_t allocSize = allocations[(myId * 100 + j) % NUM_PRE_ALLOCS ];
char* pnt = NULL;
if (USE_NEW) {
pnt = new char[allocSize];
} else {
pnt = (char*) malloc(allocSize);
}
pointers.push_back(std::make_pair(pnt, allocSize));
totalAllocBytes += allocSize;
currAllocBytes += allocSize;
// fill with values to add "delay"
for (int fill = 0; fill < (int) allocSize; ++fill) {
pnt[fill] = (char)(j % 255);
}
if (DEBUG_ALLOCS) {
std::cout << "Id " << myId << " New alloc " << pointers.size() << ", bytes:" << allocSize << " at " << (uint64_t) pnt << "\n";
}
// free all or just a bit
if (((j % 5) == 0) || (j == (NUM_ALLOCATIONS - 1))) {
int frees = 0;
// keep this much allocated
// last check, free all
uint64_t memLimit = MEM_LIMIT;
if (j == NUM_ALLOCATIONS - 1) {
std::cout << "Id " << myId << " about to release all memory: " << (currAllocBytes / (double)(1024 * 1024)) << " MB" << std::endl;
memLimit = 0;
}
//MEM_LIMIT = 0; // DEBUG
while (pointers.size() > 0 && (currAllocBytes > memLimit)) {
// free one of the first entries to allow previously obtained resources to 'live' longer
currAllocBytes -= pointers.front().second;
char* pnt = pointers.front().first;
// free memory
if (USE_NEW) {
delete[] pnt;
} else {
free(pnt);
}
// update array
pointers.pop_front();
if (DEBUG_ALLOCS) {
std::cout << "Id " << myId << " Free'd " << pointers.size() << " at " << (uint64_t) pnt << "\n";
}
frees++;
}
if (DEBUG_ALLOCS) {
std::cout << "Frees " << frees << ", " << currAllocBytes << "/" << MEM_LIMIT << ", " << totalAllocBytes << "\n";
}
}
} // for each allocation
if (currAllocBytes != 0) {
std::cerr << "Not all free'd!\n";
}
std::cout << "Id " << myId << " done, total alloc'ed " << ((double) totalAllocBytes / (double)(1024 * 1024)) << "MB \n";
} // for each iteration
exit(1);
}
int main(int argc, char** argv)
{
runParallelAllocTest();
return 0;
}
Run Code Online (Sandbox Code Playgroud)
从我目前看来,硬件很重要.如果在更快的机器上运行,测试可能需要调整.
Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz
Ubuntu 10.04 LTS 64 bit
gcc 4.3, 4.4, 4.6
3988.62 Bogomips
Run Code Online (Sandbox Code Playgroud)
一旦执行了makefile,就应该得到一个名为的文件ompmemtest
.为了查询内存使用情况,我使用了以下命令:
./ompmemtest &
top -b | grep ompmemtest
Run Code Online (Sandbox Code Playgroud)
这产生了相当令人印象深刻的碎片或泄漏行为.4个线程的预期内存消耗为1090 MB,随着时间的推移变为1500 MB:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11626 byron 20 0 204m 99m 1000 R 27 2.5 0:00.81 ompmemtest
11626 byron 20 0 992m 832m 1004 R 195 21.0 0:06.69 ompmemtest
11626 byron 20 0 1118m 1.0g 1004 R 189 26.1 0:12.40 ompmemtest
11626 byron 20 0 1218m 1.0g 1004 R 190 27.1 0:18.13 ompmemtest
11626 byron 20 0 1282m 1.1g 1004 R 195 29.6 0:24.06 ompmemtest
11626 byron 20 0 1471m 1.3g 1004 R 195 33.5 0:29.96 ompmemtest
11626 byron 20 0 1469m 1.3g 1004 R 194 33.5 0:35.85 ompmemtest
11626 byron 20 0 1469m 1.3g 1004 R 195 33.6 0:41.75 ompmemtest
11626 byron 20 0 1636m 1.5g 1004 R 194 37.8 0:47.62 ompmemtest
11626 byron 20 0 1660m 1.5g 1004 R 195 38.0 0:53.54 ompmemtest
11626 byron 20 0 1669m 1.5g 1004 R 195 38.2 0:59.45 ompmemtest
11626 byron 20 0 1664m 1.5g 1004 R 194 38.1 1:05.32 ompmemtest
11626 byron 20 0 1724m 1.5g 1004 R 195 40.0 1:11.21 ompmemtest
11626 byron 20 0 1724m 1.6g 1140 S 193 40.1 1:17.07 ompmemtest
Run Code Online (Sandbox Code Playgroud)
请注意:在使用gcc 4.3,4.4和4.6(主干)进行编译时,我可以重现此问题.
seh*_*ehe 22
好的,拿起诱饵.
这是在一个系统上
Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz
4x5666.59 bogomips
Linux meerkat 2.6.35-28-generic-pae #50-Ubuntu SMP Fri Mar 18 20:43:15 UTC 2011 i686 GNU/Linux
gcc version 4.4.5
total used free shared buffers cached
Mem: 8127172 4220560 3906612 0 374328 2748796
-/+ buffers/cache: 1097436 7029736
Swap: 0 0 0
Run Code Online (Sandbox Code Playgroud)
我跑了
time ./ompmemtest
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB
Id 1 about to release all memory: 257.339 MB
Id 2 about to release all memory: 257.043 MB
Id 1 done, total alloc'ed -1570.42MB
Id 2 done, total alloc'ed -1569.96MB
real 0m13.429s
user 0m44.619s
sys 0m6.000s
Run Code Online (Sandbox Code Playgroud)
没什么了不起的.这是同时输出的vmstat -S M 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
0 0 0 3892 364 2669 0 0 24 0 701 1487 2 1 97 0
4 0 0 3421 364 2669 0 0 0 0 1317 1953 53 7 40 0
4 0 0 2858 364 2669 0 0 0 0 2715 5030 79 16 5 0
4 0 0 2861 364 2669 0 0 0 0 6164 12637 76 15 9 0
4 0 0 2853 364 2669 0 0 0 0 4845 8617 77 13 10 0
4 0 0 2848 364 2669 0 0 0 0 3782 7084 79 13 8 0
5 0 0 2842 364 2669 0 0 0 0 3723 6120 81 12 7 0
4 0 0 2835 364 2669 0 0 0 0 3477 4943 84 9 7 0
4 0 0 2834 364 2669 0 0 0 0 3273 4950 81 10 9 0
5 0 0 2828 364 2669 0 0 0 0 3226 4812 84 11 6 0
4 0 0 2823 364 2669 0 0 0 0 3250 4889 83 10 7 0
4 0 0 2826 364 2669 0 0 0 0 3023 4353 85 10 6 0
4 0 0 2817 364 2669 0 0 0 0 3176 4284 83 10 7 0
4 0 0 2823 364 2669 0 0 0 0 3008 4063 84 10 6 0
0 0 0 3893 364 2669 0 0 0 0 4023 4228 64 10 26 0
Run Code Online (Sandbox Code Playgroud)
这些信息对你意味着什么吗?
现在为了真正的乐趣,添加一点点香料
time LD_PRELOAD="/usr/lib/libtcmalloc.so" ./ompmemtest
Id 1 about to release all memory: 257.339 MB
Id 1 done, total alloc'ed -1570.42MB
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB
Id 2 about to release all memory: 257.043 MB
Id 2 done, total alloc'ed -1569.96MB
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB
real 0m11.663s
user 0m44.255s
sys 0m1.028s
Run Code Online (Sandbox Code Playgroud)
看起来更快,不是吗?
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
4 0 0 3562 364 2684 0 0 0 0 1041 1676 28 7 64 0
4 2 0 2806 364 2684 0 0 0 172 1641 1843 84 14 1 0
4 0 0 2758 364 2685 0 0 0 0 1520 1009 98 2 1 0
4 0 0 2747 364 2685 0 0 0 0 1504 859 98 2 0 0
5 0 0 2745 364 2685 0 0 0 0 1575 1073 98 2 0 0
5 0 0 2739 364 2685 0 0 0 0 1415 743 99 1 0 0
4 0 0 2738 364 2685 0 0 0 0 1526 981 99 2 0 0
4 0 0 2731 364 2685 0 0 0 684 1536 927 98 2 0 0
4 0 0 2730 364 2685 0 0 0 0 1584 1010 99 1 0 0
5 0 0 2730 364 2685 0 0 0 0 1461 917 99 2 0 0
4 0 0 2729 364 2685 0 0 0 0 1561 1036 99 1 0 0
4 0 0 2729 364 2685 0 0 0 0 1406 756 100 1 0 0
0 0 0 3819 364 2685 0 0 0 4 1159 1476 26 3 71 0
Run Code Online (Sandbox Code Playgroud)
如果你想比较vmstat输出
Valgrind --tool massif
这是ms_print
after valgrind --tool=massif ./ompmemtest
(默认malloc)的输出头:
--------------------------------------------------------------------------------
Command: ./ompmemtest
Massif arguments: (none)
ms_print arguments: massif.out.beforetcmalloc
--------------------------------------------------------------------------------
GB
1.009^ :
| ##::::@@:::::::@@::::::@@::::@@::@::::@::::@:::::::::@::::::@:::
| # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::
| # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::
| :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::
| :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::
| :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
| ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
| ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
| ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
| ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
| ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
| ::::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
| : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
| : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
| :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
| :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
| ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
| ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
| ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
0 +----------------------------------------------------------------------->Gi
0 264.0
Number of snapshots: 63
Detailed snapshots: [6 (peak), 10, 17, 23, 27, 30, 35, 39, 48, 56]
Run Code Online (Sandbox Code Playgroud)
不幸的是,香草valgrind
不起作用tcmalloc
,所以我将赛马调到了堆中google-perftools
gcc openMpMemtest_Linux.cpp -fopenmp -lgomp -lstdc++ -ltcmalloc -o ompmemtest
time HEAPPROFILE=/tmp/heapprofile ./ompmemtest
Starting tracking the heap
Dumping heap profile to /tmp/heapprofile.0001.heap (100 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0002.heap (200 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0003.heap (300 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0004.heap (400 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0005.heap (501 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0006.heap (601 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0007.heap (701 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0008.heap (801 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0009.heap (902 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0010.heap (1002 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0011.heap (2029 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0012.heap (3053 MB allocated cumulatively, 1030 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0013.heap (4078 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0014.heap (5102 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0015.heap (6126 MB allocated cumulatively, 1033 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0016.heap (7151 MB allocated cumulatively, 1029 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0017.heap (8175 MB allocated cumulatively, 1029 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0018.heap (9199 MB allocated cumulatively, 1028 MB currently in use)
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB
Id 2 about to release all memory: 257.043 MB
Id 2 done, total alloc'ed -1569.96MB
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB
Id 1 about to release all memory: 257.339 MB
Id 1 done, total alloc'ed -1570.42MB
Dumping heap profile to /tmp/heapprofile.0019.heap (Exiting)
real 0m11.981s
user 0m44.455s
sys 0m1.124s
Run Code Online (Sandbox Code Playgroud)
评论:我更新了程序
--- omptest/openMpMemtest_Linux.cpp 2011-05-03 23:18:44.000000000 +0200
+++ q/openMpMemtest_Linux.cpp 2011-05-04 13:42:47.371726000 +0200
@@ -13,8 +13,8 @@
void runParallelAllocTest()
{
// constants
- const int NUM_ALLOCATIONS = 5000; // alloc's per thread
- const int NUM_THREADS = 4; // how many threads?
+ const int NUM_ALLOCATIONS = 55000; // alloc's per thread
+ const int NUM_THREADS = 8; // how many threads?
const int NUM_ITERS = NUM_THREADS;// how many overall repetions
const bool USE_NEW = true; // use new or malloc? , seems to make no difference (as it should)
Run Code Online (Sandbox Code Playgroud)
它跑了超过5立方米.接近尾声,htop的屏幕截图教导确实,预留的设置略高,朝向2.3g:
1 [||||||||||||||||||||||||||||||||||||||||||||||||||96.7%] Tasks: 125 total, 2 running
2 [||||||||||||||||||||||||||||||||||||||||||||||||||96.7%] Load average: 8.09 5.24 2.37
3 [||||||||||||||||||||||||||||||||||||||||||||||||||97.4%] Uptime: 01:54:22
4 [||||||||||||||||||||||||||||||||||||||||||||||||||96.1%]
Mem[||||||||||||||||||||||||||||||| 3055/7936MB]
Swp[ 0/0MB]
PID USER NLWP PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
4330 sehe 8 20 0 2635M 2286M 908 R 368. 28.8 15:35.01 ./ompmemtest
Run Code Online (Sandbox Code Playgroud)
比较结果与tcmalloc运行:4m12s,类似的顶级统计数据有微小的差异; 最大的区别在于VIRT集(但除非每个进程的地址空间非常有限,否则这不是特别有用).如果你问我,RES套装非常相似.更值得注意的是,并行性增加了; 所有核心现在都已超出范围.这显然是因为在使用tcmalloc时需要锁定堆操作:
如果空闲列表为空:(1)我们从这个大小类的中央空闲列表中获取一堆对象(所有线程共享中央空闲列表).(2)将它们放在线程本地空闲列表中.(3)将一个新获取的对象返回给应用程序.
1 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Tasks: 172 total, 2 running
2 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Load average: 7.39 2.92 1.11
3 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Uptime: 11:12:25
4 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
Mem[|||||||||||||||||||||||||||||||||||||||||||| 3278/7936MB]
Swp[ 0/0MB]
PID USER NLWP PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
14391 sehe 8 20 0 2251M 2179M 1148 R 379. 27.5 8:08.92 ./ompmemtest
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
3502 次 |
最近记录: |