I wanted to use NTT for fast squaring (see Fast bignum square computation), but the result is slow even for really big numbers .. more than 12000 bits.
So my question is:
Is there a way to optimize my NTT transform?
I did not mean to speed it by parallelism (threads); this is low-level layer only.
Is there a way to speed up my modular arithmetics?
This is my (already optimized) source code in C++ for NTT (it's complete …