Mic*_*eyn 16 c++ algorithm prime-factoring
需要考虑的是以下函数,该函数可用于(相对快速地)将64位无符号整数分解为其素因子.注意,因子分解不是概率性的(即,它是精确的).在现代硬件上,该算法已经足够快,可以在几秒钟内找到一个数字是素数或几乎没有很大的因子.
问题:可以对所提出的算法进行任何改进,同时保持它是单线程的,这样它可以更快地考虑(任意)非常大的无符号64位整数,最好不使用概率方法(例如,Miller-Rabin)确定素性?
// system specific typedef for ulong should go here (or use boost::uint64_t)
typedef unsigned __int64 ulong;
typedef std::vector<ulong> ULongVector;
// Caller needs to pass in an empty factors vector
void GetFactors(ULongVector &factors, ulong num)
{
// Num has to be at least 2 to contain "prime" factors
if (num<2)
return;
ulong workingNum=num;
ulong nextOffset=2; // Will be used to skip multiples of 3, later
// Factor out factors of 2
while (workingNum%2==0)
{
factors.push_back(2);
workingNum/=2;
}
// Factor out factors of 3
while (workingNum%3==0)
{
factors.push_back(3);
workingNum/=3;
}
// If all of the factors were 2s and 3s, done...
if (workingNum==1)
return;
// sqrtNum is the (inclusive) upper bound of our search for factors
ulong sqrtNum=(ulong) sqrt(double(workingNum+0.5));
// Factor out potential factors that are greate than or equal to 5
// The variable n represents the next potential factor to be tested
for (ulong n=5;n<=sqrtNum;)
{
// Is n a factor of the current working number?
if (workingNum%n==0)
{
// n is a factor, so add it to the list of factors
factors.push_back(n);
// Divide current working number by n, to get remaining number to factor
workingNum/=n;
// Check if we've found all factors
if (workingNum==1)
return;
// Recalculate the new upper bound for remaining factors
sqrtNum=(ulong) sqrt(double(workingNum+0.5));
// Recheck if n is a factor of the new working number,
// in case workingNum contains multiple factors of n
continue;
}
// n is not or is no longer a factor, try the next odd number
// that is not a multiple of 3
n+=nextOffset;
// Adjust nextOffset to be an offset from n to the next non-multiple of 3
nextOffset=(nextOffset==2UL ? 4UL : 2UL);
}
// Current workingNum is prime, add it as a factor
factors.push_back(workingNum);
}
Run Code Online (Sandbox Code Playgroud)
谢谢
编辑:我添加了更多评论.向量通过引用传入的原因是允许向量在调用之间重用,并避免动态分配.向量未在函数中清空的原因是允许将当前"num"因子附加到向量中已有的因子的奇怪要求.
函数本身并不漂亮,可以重构,但问题是如何使算法更快.所以,请不要提出如何使函数更漂亮,可读或C++的建议.这是孩子的游戏.改进这种算法,使其能够更快地找到(已证实的)因素是困难的部分.
更新:到目前为止Potatoswatter有一些很好的解决方案,请务必查看底部附近的MMX解决方案.
Pot*_*ter 19
将这种方法与(预先生成的)筛子进行比较.模数很昂贵,因此这两种方法基本上都做了两件事:生成潜在因素,并执行模运算.两个程序都应该以比模数采用更少的周期合理地生成新的候选因子,因此任一程序都是模数约束的.
给定的方法过滤掉所有整数的恒定比例,即2和3的倍数,或75%.四分之一(给定)数字用作模运算符的参数.我称之为跳过滤镜.
另一方面,筛子仅使用素数作为模运算符的参数,并且连续素数之间的平均差异由素数定理控制为1/ln(N).例如,e ^ 20不到5亿,因此超过5亿的人有5%的机会成为素数.如果考虑所有高达2 ^ 32的数字,5%是一个很好的经验法则.
因此,div筛选器的操作时间将比跳过滤器少5倍.要考虑的下一个因素是筛子产生质数的速度,即从内存或磁盘读取它们.如果取一个素数比4 div秒快,则筛子更快.根据我的表div,我的Core2上的吞吐量最多每12个周期一个.这些将是严格的划分问题,所以让我们保守地预算每个素数50个周期.对于2.5 GHz处理器,这是20纳秒.
在20 ns内,50 MB /秒的硬盘驱动器可以读取大约一个字节.简单的解决方案是每个素数使用4个字节,因此驱动器会更慢.但是,我们可以更聪明.如果我们想按顺序编码所有素数,我们可以只编码它们的差异.同样,预期的差异是1/ln(N).而且,它们都是均匀的,这节省了额外的一点.它们永远不会为零,这使得多字节编码的扩展免费.因此,每个素数使用一个字节,最多512个差异可以存储在一个字节中,根据维基百科的文章,这可以达到303371455241 .
因此,根据硬盘驱动器,存储的素数列表在验证素数时的速度应大致相等.如果它可以存储在RAM中(它是203 MB,因此后续运行可能会达到磁盘缓存),那么问题就完全消失了,因为FSB速度通常与处理器速度相差小于FSB宽度(以字节为单位) - 即,FSB每个周期可以传输多个素数.然后改进的因素是除法运算的减少,即五次.下面的实验结果证实了这一点.
当然,那就是多线程.可以将素数或跳过过滤候选者的范围分配给不同的线程,使得任一方法都令人尴尬地平行.没有优化不涉及增加并行分频器电路的数量,除非你以某种方式消除模数.
这是一个这样的程序.这是模板化的,所以你可以添加bignums.
/*
* multibyte_sieve.cpp
* Generate a table of primes, and use it to factorize numbers.
*
* Created by David Krauss on 10/12/10.
*
*/
#include <cmath>
#include <bitset>
#include <limits>
#include <memory>
#include <fstream>
#include <sstream>
#include <iostream>
#include <iterator>
#include <stdint.h>
using namespace std;
char const primes_filename[] = "primes";
enum { encoding_base = (1<< numeric_limits< unsigned char >::digits) - 2 };
template< typename It >
unsigned decode_gap( It &stream ) {
unsigned gap = static_cast< unsigned char >( * stream ++ );
if ( gap ) return 2 * gap; // only this path is tested
gap = ( decode_gap( stream )/2-1 ) * encoding_base; // deep recursion
return gap + decode_gap( stream ); // shallow recursion
}
template< typename It >
void encode_gap( It &stream, uint32_t gap ) {
unsigned len = 0, bytes[4];
gap /= 2;
do {
bytes[ len ++ ] = gap % encoding_base;
gap /= encoding_base;
} while ( gap );
while ( -- len ) { // loop not tested
* stream ++ = 0;
* stream ++ = bytes[ len + 1 ];
}
* stream ++ = bytes[ 0 ];
}
template< size_t lim >
void generate_primes() {
auto_ptr< bitset< lim / 2 > > sieve_p( new bitset< lim / 2 > );
bitset< lim / 2 > &sieve = * sieve_p;
ofstream out_f( primes_filename, ios::out | ios::binary );
ostreambuf_iterator< char > out( out_f );
size_t count = 0;
size_t last = sqrtl( lim ) / 2 + 1, prev = 0, x = 1;
for ( ; x != last; ++ x ) {
if ( sieve[ x ] ) continue;
size_t n = x * 2 + 1; // translate index to number
for ( size_t m = x + n; m < lim/2; m += n ) sieve[ m ] = true;
encode_gap( out, ( x - prev ) * 2 );
prev = x;
}
for ( ; x != lim / 2; ++ x ) {
if ( sieve[ x ] ) continue;
encode_gap( out, ( x - prev ) * 2 );
prev = x;
}
cout << prev * 2 + 1 << endl;
}
template< typename I >
void factorize( I n ) {
ifstream in_f( primes_filename, ios::in | ios::binary );
if ( ! in_f ) {
cerr << "Could not open primes file.\n"
"Please generate it with 'g' command.\n";
return;
}
while ( n % 2 == 0 ) {
n /= 2;
cout << "2 ";
}
unsigned long factor = 1;
for ( istreambuf_iterator< char > in( in_f ), in_end; in != in_end; ) {
factor += decode_gap( in );
while ( n % factor == 0 ) {
n /= factor;
cout << factor << " ";
}
if ( n == 1 ) goto finish;
}
cout << n;
finish:
cout << endl;
}
int main( int argc, char *argv[] ) {
if ( argc != 2 ) goto print_help;
unsigned long n;
if ( argv[1][0] == 'g' ) {
generate_primes< (1ul<< 32) >();
} else if ( ( istringstream( argv[1] ) >> n ).rdstate() == ios::eofbit )
factorize( n );
} else goto print_help;
return 0;
print_help:
cerr << "Usage:\n\t" << argv[0] << " <number> -- factorize number.\n"
"\t" << argv[0] << " g -- generate primes file in current directory.\n";
}
Run Code Online (Sandbox Code Playgroud)
2.2 GHz MacBook Pro的性能:
dkrauss$ time ./multibyte_sieve g
4294967291
real 2m8.845s
user 1m15.177s
sys 0m2.446s
dkrauss$ time ./multibyte_sieve 18446743721522234449
4294967231 4294967279
real 0m5.405s
user 0m4.773s
sys 0m0.458s
dkrauss$ time ./mike 18446743721522234449
4294967231 4294967279
real 0m25.147s
user 0m24.170s
sys 0m0.096s
Run Code Online (Sandbox Code Playgroud)
我的另一个答案是相当长的,与此完全不同,所以这里有别的东西.
这个程序不是仅过滤掉前两个素数的倍数,或者每个字节编码所有相关素数,而是过滤掉所有适合8位的质数的倍数,特别是2到211.所以不要过去33%的数字,这大约10%传递给分部运营商.
素数保存在三个SSE寄存器中,其运行计数器的模数保持在另外三个.如果具有计数器的任何素数的模数等于零,则计数器不能是素数.此外,如果任何模数等于1,则(计数器+ 2)不能是素数等,直到(计数器+30).偶数会被忽略,并且会跳过+3,+ 6和+5等偏移量.矢量处理允许一次更新16个字节大小的变量.
抛出一个厨房水槽完全微观优化(但没有更多的平台特定于内联指令),我得到了1.78倍的性能提升(24秒对13.4秒我的笔记本电脑).如果使用bignum库(即使是非常快的库),优势也更大.请参阅下面的更具可读性的预优化版本.
/*
* factorize_sse.cpp
* Filter out multiples of the first 47 primes while factorizing a number.
*
* Created by David Krauss on 10/14/10.
*
*/
#include <cmath>
#include <sstream>
#include <iostream>
#include <xmmintrin.h>
using namespace std;
inline void remove_factor( unsigned long &n, unsigned long factor ) __attribute__((always_inline));
void remove_factor( unsigned long &n, unsigned long factor ) {
while ( n % factor == 0 ) {
n /= factor;
cout << factor << " ";
}
}
int main( int argc, char *argv[] ) {
unsigned long n;
if ( argc != 2
|| ( istringstream( argv[1] ) >> n >> ws ).rdstate() != ios::eofbit ) {
cerr << "Usage: " << argv[0] << " <number>\n";
return 1;
}
int primes[] = { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47,
53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127,
131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211 };
for ( int *p = primes; p < primes + sizeof primes/sizeof *primes; ++ p ) {
remove_factor( n, * p );
}
//int histo[8] = {}, total = 0;
enum { bias = 15 - 128 };
__m128i const prime1 = _mm_set_epi8( 21, 21, 21, 22, 22, 26, 26, 17, 19, 23, 29, 31, 37, 41, 43, 47 ),
prime2 = _mm_set_epi8( 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127 ),
prime3 = _mm_set_epi8( 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211 ),
vbias = _mm_set1_epi8( bias ),
v3 = _mm_set1_epi8( 3+bias ), v5 = _mm_set1_epi8( 5+bias ), v6 = _mm_set1_epi8( 6+bias ), v8 = _mm_set1_epi8( 8+bias ),
v9 = _mm_set1_epi8( 9+bias ), v11 = _mm_set1_epi8( 11+bias ), v14 = _mm_set1_epi8( 14+bias ), v15 = _mm_set1_epi8( 15+bias );
__m128i mod1 = _mm_add_epi8( _mm_set_epi8( 3, 10, 17, 5, 16, 6, 19, 8, 9, 11, 14, 15, 18, 20, 21, 23 ), vbias ),
mod2 = _mm_add_epi8( _mm_set_epi8( 26, 29, 30, 33, 35, 36, 39, 41, 44, 48, 50, 51, 53, 54, 56, 63 ), vbias ),
mod3 = _mm_add_epi8( _mm_set_epi8( 65, 68, 69, 74, 75, 78, 81, 83, 86, 89, 90, 95, 96, 98, 99, 105 ), vbias );
for ( unsigned long factor = 1, limit = sqrtl( n ); factor <= limit + 30; factor += 30 ) {
if ( n == 1 ) goto done;
// up to 2^32, distribution of number candidates produced (0 up to 7) is
// 0.010841 0.0785208 0.222928 0.31905 0.246109 0.101023 0.0200728 0.00145546
unsigned candidates[8], *cand_pen = candidates;
* cand_pen = 6;
cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1, v3 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2, v3 ), _mm_cmpeq_epi8( mod3, v3 ) ) ) );
* cand_pen = 10;
cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1, v5 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2, v5 ), _mm_cmpeq_epi8( mod3, v5 ) ) ) );
* cand_pen = 12;
cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1, v6 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2, v6 ), _mm_cmpeq_epi8( mod3, v6 ) ) ) );
* cand_pen = 16;
cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1, v8 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2, v8 ), _mm_cmpeq_epi8( mod3, v8 ) ) ) );
* cand_pen = 18;
cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1, v9 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2, v9 ), _mm_cmpeq_epi8( mod3, v9 ) ) ) );
* cand_pen = 22;
cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1, v11 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2, v11 ), _mm_cmpeq_epi8( mod3, v11 ) ) ) );
* cand_pen = 28;
cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1, v14 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2, v14 ), _mm_cmpeq_epi8( mod3, v14 ) ) ) );
* cand_pen = 30;
cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1, v15 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2, v15 ), _mm_cmpeq_epi8( mod3, v15 ) ) ) );
/*++ total;
++ histo[ cand_pen - candidates ];
cout << "( ";
while ( cand_pen != candidates ) cout << factor + * -- cand_pen << " ";
cout << ")" << endl; */
mod1 = _mm_sub_epi8( mod1, _mm_set1_epi8( 15 ) ); // update residuals
__m128i mask1 = _mm_cmplt_epi8( mod1, _mm_set1_epi8( 1+bias ) );
mask1 = _mm_and_si128( mask1, prime1 ); // residual goes to zero or negative?
mod1 = _mm_add_epi8( mask1, mod1 ); // combine reset into zero or negative
mod2 = _mm_sub_epi8( mod2, _mm_set1_epi8( 15 ) );
__m128i mask2 = _mm_cmplt_epi8( mod2, _mm_set1_epi8( 1+bias ) );
mask2 = _mm_and_si128( mask2, prime2 );
mod2 = _mm_add_epi8( mask2, mod2 );
mod3 = _mm_sub_epi8( mod3, _mm_set1_epi8( 15 ) );
__m128i mask3 = _mm_cmplt_epi8( mod3, _mm_set1_epi8( 1+bias ) );
mask3 = _mm_and_si128( mask3, prime3 );
mod3 = _mm_add_epi8( mask3, mod3 );
if ( cand_pen - candidates == 0 ) continue;
remove_factor( n, factor + candidates[ 0 ] );
if ( cand_pen - candidates == 1 ) continue;
remove_factor( n, factor + candidates[ 1 ] );
if ( cand_pen - candidates == 2 ) continue;
remove_factor( n, factor + candidates[ 2 ] );
if ( cand_pen - candidates == 3 ) continue;
remove_factor( n, factor + candidates[ 3 ] );
if ( cand_pen - candidates == 4 ) continue;
remove_factor( n, factor + candidates[ 4 ] );
if ( cand_pen - candidates == 5 ) continue;
remove_factor( n, factor + candidates[ 5 ] );
if ( cand_pen - candidates == 6 ) continue;
remove_factor( n, factor + candidates[ 6 ] );
}
cout << n;
done:
/*cout << endl;
for ( int hx = 0; hx < 8; ++ hx ) cout << (float) histo[hx] / total << " ";*/
cout << endl;
}
Run Code Online (Sandbox Code Playgroud)
.
dkrauss$ /usr/local/bin/g++ main.cpp -o factorize_sse -O3 --profile-use
dkrauss$ time ./factorize_sse 18446743721522234449
4294967231 4294967279
real 0m13.437s
user 0m13.393s
sys 0m0.011s
Run Code Online (Sandbox Code Playgroud)
以下是上述的初稿.包括优化
remove_factor.remove_factor对非分支数组写入的条件性,不可预测的调用.remove_factorcall 完全展开最终循环,并确保函数始终内联.
可读版本:
/*
* factorize_sse.cpp
* Filter out multiples of the first 17 primes while factorizing a number.
*
* Created by David Krauss on 10/14/10.
*
*/
#include <cmath>
#include <sstream>
#include <iostream>
#include <xmmintrin.h>
using namespace std;
int main( int argc, char *argv[] ) {
unsigned long n;
if ( argc != 2
|| ( istringstream( argv[1] ) >> n >> ws ).rdstate() != ios::eofbit ) {
cerr << "Usage: " << argv[0] << " <number>\n";
return 1;
}
int primes[] = { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59 };
for ( int *p = primes; p < primes + sizeof primes/sizeof *primes; ++ p ) {
while ( n % * p == 0 ) {
n /= * p;
cout << * p << " ";
}
}
if ( n != 1 ) {
__m128i mod = _mm_set_epi8( 1, 2, 3, 5, 6, 8, 9, 11, 14, 15, 18, 20, 21, 23, 26, 29 );
__m128i const prime = _mm_set_epi8( 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59 ),
one = _mm_set1_epi8( 1 );
for ( unsigned long factor = 1, limit = sqrtl( n ); factor < limit; ) {
factor += 2;
__m128i mask = _mm_cmpeq_epi8( mod, one ); // residual going to zero?
mod = _mm_sub_epi8( mod, one ); // update other residuals
if ( _mm_movemask_epi8( mask ) ) {
mask = _mm_and_si128( mask, prime ); // reset cycle if going to zero
mod = _mm_or_si128( mask, mod ); // combine reset into zeroed position
} else while ( n % factor == 0 ) {
n /= factor;
cout << factor << " ";
if ( n == 1 ) goto done;
}
}
cout << n;
}
done:
cout << endl;
}
Run Code Online (Sandbox Code Playgroud)
Fermat的分解方法简单快捷,只要在它变得过长而变慢之前就停止它,就可以找到成对的大质因子.然而,在我对随机数的测试中,这种情况太罕见,看不到任何改进.
...不使用概率方法(例如,Miller-Rabin)来确定素数
通过均匀分布,75%的输入将需要十亿次循环迭代,因此,即使您得到一个不确定的答案并且必须恢复到试验部门,也应该首先在不太确定的技术上花费一百万次操作.
我发现Pollard的Rho方法的布伦特变体非常好,但编码和理解更复杂.我见过的最好的例子来自这个论坛的讨论.该方法依赖于运气,但经常有助于获得成功.
Miller-Rabin素性测试实际上是确定性的,大约10 ^ 15,这可以省去无果搜索的麻烦.
我尝试了几十种变体,并根据以下因素来确定int64值:
请注意,Pollard的Rho发现不一定是素数的因子,因此可以使用递归来计算这些因素.