bwe*_*r13 0 c++ multithreading valgrind openmp intrinsics
I am reading an array in parallel using openmp. Below is a minimal reproducible example:
#include <cstdint>
#include <cstdlib>
#include <immintrin.h>
#include <iostream>
#include <memory>
#include <omp.h>
int main(int argc, char* argv[]){
// align to cache line, which is 512 bits or 64 bytes
size_t actualSize = 2048;
uint8_t* array = static_cast<uint8_t *>(aligned_alloc(64, actualSize));
for(size_t i = 0; i < actualSize; i++){
// initialize values
array[i] = rand() % 256;
}
__m256i sum_v = _mm256_setzero_si256 ();
#pragma omp parallel for
for (size_t i = 0; i < actualSize; i+=32){
__m256i v1 = _mm256_load_si256((const __m256i *) array+i);
// i understand that there is a race condition here, but I'm just
// concerned with the memory leaks
sum_v = _mm256_add_epi8 (sum_v, v1);
}
// just to keep compiler from optimizing out sum_v
uint8_t result = _mm256_extract_epi8 (sum_v, 0);
std::cout << "result: " << result << std::endl;
free(array);
return 0;
}
Run Code Online (Sandbox Code Playgroud)
This is an attempt to measure memory bandwidth on my computer, I will eventually time this for different actualSizes.
I compile this with g++ -Wall -g -std=c++1y -march=native -mtune=native -fopenmp -O3 -g minimal-memleaks.cpp. When I run this program using valgrind ./a.out, I get a memory leak, part of which is copied below
==7688== Thread 8:
==7688== Invalid read of size 32
==7688== at 0x108D30: _mm256_add_epi8 (avx2intrin.h:107)
==7688== by 0x108D30: main._omp_fn.0 (minimal-memleaks.cpp:25)
==7688== by 0x51DB95D: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==7688== by 0x5FA66DA: start_thread (pthread_create.c:463)
==7688== by 0x551588E: clone (clone.S:95)
==7688== Address 0x61e1980 is 29,280 bytes inside an unallocated block of size 4,077,760 in arena "client"
Run Code Online (Sandbox Code Playgroud)
full output available here: https://pastebin.com/qr0W9FGD
I can't seem to see why. At first I thought the loop was going past the 2048 Bytes I allocated, but my math says it shouldn't. I read in blocks of 32, and adding 32 to i will eventually equal 2048, when the loop should stop. I also thought perhaps the main thread was closing before the child threads, but my research suggests that the main thread will not close until the threads created by the #pragma omp parallel for loop do. Is this incorrect?
Thank you for any help you can provide.
This is not a memory leak. You're reading and/or corrupting memory.
for (size_t i = 0; i < actualSize; i+=32){
__m256i v1 = _mm256_load_si256((const __m256i *) array+i);
Run Code Online (Sandbox Code Playgroud)
You're running off the end of the array, here. actualSize is the size of your allocated array in bytes.
__m256i is a data type that's 32 bytes long.
(const __m256i *) array
Run Code Online (Sandbox Code Playgroud)
This converts the pointer to a pointer to a 32 byte object.
The way that pointer addition works in C++, is that adding one to a pointer advances the pointer to the next object, so
array+1
Run Code Online (Sandbox Code Playgroud)
is where the next 32-byte object is. Which is 32 bytes after array.
So if you work out which addresses your for loop ends up reading, it should be obvious that it's running off the end of your array into never-never land, and valgrind barks at you because of that.
Your for loop should probably be:
for (size_t i = 0; i < actualSize/32; ++i){
__m256i v1 = _mm256_load_si256((const __m256i *) array+i);
Run Code Online (Sandbox Code Playgroud)