Lock-free fetch_add much slower than mutex?

Ken*_*eth -1 c++ mutex lock-free

My test shows that using lock-free atomic fetch_add is much slower than using mutex. I am running 10 parallel threads adding 1 million to the shared counter each. I am using relaxed memory order. Did I do the atomic wrong?

output from test:

fetch_add elapsed time: 4.14048 secs
fetch_add elapsed time: 5.37776 secs
fetch_add elapsed time: 5.37771 secs
fetch_add elapsed time: 5.37791 secs
fetch_add elapsed time: 5.37865 secs
fetch_add elapsed time: 7.85113 secs
fetch_add elapsed time: 7.93542 secs
fetch_add elapsed time: 7.96275 secs
fetch_add elapsed time: 7.97306 secs
Result from fetch_add:1000000000
Lock elapsed time: 0.214009 secs
Lock elapsed time: 0.418662 secs
Lock elapsed time: 0.633993 secs
Lock elapsed time: 0.836704 secs
Lock elapsed time: 1.04225 secs
Lock elapsed time: 1.24322 secs
Lock elapsed time: 1.44627 secs
Lock elapsed time: 1.64898 secs
Lock elapsed time: 1.8539 secs
Lock elapsed time: 2.05816 secs
Result from lock:1000000000
Run Code Online (Sandbox Code Playgroud)

Testing code:

#include <atomic>
#include <cstdlib>
#include <time.h>       /* time */
#include <iostream>
#include <thread>
#include <mutex>
#include <chrono> 

const int BILLION = 1000000000L;
std::mutex my_mutex;
int global_val = 0;
std::atomic<unsigned long> global_a_val(0);

void LockFree(int length) {
  struct timespec start, end;
  clock_gettime(CLOCK_MONOTONIC, &start);
  for (int i = 0; i < length; i++)
    global_a_val.fetch_add(1, std::memory_order_relaxed);
  clock_gettime(CLOCK_MONOTONIC, &end);

  auto diff = BILLION * (end.tv_sec - start.tv_sec) + end.tv_nsec - start.tv_nsec;
  std::cout<<"fetch_add elapsed time: "<< (double)diff/BILLION <<" secs\n";
}

void Lock(int length) {
  struct timespec start, end;

  clock_gettime(CLOCK_MONOTONIC, &start);
  {
    const std::lock_guard<std::mutex> my_lock(my_mutex);
    for (int i = 0; i < length; i++) {
      auto pre = ++global_val;
    }
  }
  clock_gettime(CLOCK_MONOTONIC, &end);

  auto diff = BILLION * (end.tv_sec - start.tv_sec) + end.tv_nsec - start.tv_nsec;
  std::cout<<"Lock elapsed time: "<< (double)diff/BILLION <<" secs\n";
}

int main() {
  const int thread_count = 10;
  const int count = 100000000;
  std::thread t[thread_count];
  for (int i = 0; i < thread_count; i++) {
    t[i] = std::thread(LockFree, count);
  }

  for (int i = 0; i < thread_count; i++) {
    t[i].join();
  }
  std::cout<<"Result from fetch_add:"<<global_a_val<<"\n";
  
  std::thread t2[thread_count];
  for (int i = 0; i < thread_count; i++) {
    t2[i] = std::thread(Lock, count);
  }

  for (int i = 0; i < thread_count; i++) {
    t2[i].join();
  }
  std::cout<<"Result from lock:"<<global_val<<"\n";
}
Run Code Online (Sandbox Code Playgroud)

Afs*_*hin 6

It is obvious that it will be slower because you lock mutex only once. Here in this code:

  {
    const std::lock_guard<std::mutex> my_lock(my_mutex);
    for (int i = 0; i < length; i++) {
      auto pre = ++global_val;
    }
  }
Run Code Online (Sandbox Code Playgroud)

You only lock mutex once and then you use loop to normally add values, but in fetch_add atomic add happens every time. It is obvious that a normal add is quicker than atomic add and this comparison is not correct. If you want to compare them correctly, you need a code like this:

  {   
    for (int i = 0; i < length; i++) {
      const std::lock_guard<std::mutex> my_lock(my_mutex);
      auto pre = ++global_val;
    }
  }
Run Code Online (Sandbox Code Playgroud)

To compare, your code on my vm gives following times:

fetch_add elapsed time: 11.5873 secs
fetch_add elapsed time: 12.0809 secs
fetch_add elapsed time: 12.2435 secs
fetch_add elapsed time: 12.5315 secs
fetch_add elapsed time: 12.6408 secs
fetch_add elapsed time: 12.7246 secs
fetch_add elapsed time: 12.8112 secs
fetch_add elapsed time: 12.813 secs
fetch_add elapsed time: 12.8739 secs
fetch_add elapsed time: 12.9213 secs
Result from fetch_add:1000000000
Lock elapsed time: 0.259349 secs
Lock elapsed time: 0.51883 secs
Lock elapsed time: 0.779507 secs
Lock elapsed time: 1.03916 secs
Lock elapsed time: 1.3016 secs
Lock elapsed time: 1.56413 secs
Lock elapsed time: 1.82421 secs
Lock elapsed time: 2.08337 secs
Lock elapsed time: 2.34295 secs
Lock elapsed time: 2.60248 secs
Result from lock:1000000000
Run Code Online (Sandbox Code Playgroud)

but after the change that I mentioned, results will be like this:

fetch_add elapsed time: 11.4242 secs
fetch_add elapsed time: 11.9451 secs
fetch_add elapsed time: 11.9901 secs
fetch_add elapsed time: 12.0678 secs
fetch_add elapsed time: 12.1317 secs
fetch_add elapsed time: 12.505 secs
fetch_add elapsed time: 12.5591 secs
fetch_add elapsed time: 12.7394 secs
fetch_add elapsed time: 12.8966 secs
fetch_add elapsed time: 12.9199 secs
Result from fetch_add:1000000000
Lock elapsed time: 70.9624 secs
Lock elapsed time: 70.9993 secs
Lock elapsed time: 71.0727 secs
Lock elapsed time: 71.0932 secs
Lock elapsed time: 71.0966 secs
Lock elapsed time: 71.1221 secs
Lock elapsed time: 71.143 secs
Lock elapsed time: 71.1462 secs
Lock elapsed time: 71.158 secs
Lock elapsed time: 71.1588 secs
Result from lock:1000000000
Run Code Online (Sandbox Code Playgroud)