C#:Atkin筛选的实施

Svi*_*ish 7 c# algorithm primes sieve-of-atkin

我想知道是否有人在这里有一个很好的实施他们想要分享的阿特金筛选.

我正在尝试实现它,但不能完全包围它.这是我到目前为止所拥有的.

public class Atkin : IEnumerable<ulong>
{
    private readonly List<ulong> primes;
    private readonly ulong limit;

    public Atkin(ulong limit)
    {
        this.limit = limit;
        primes = new List<ulong>();
    }

    private void FindPrimes()
    {
        var isPrime = new bool[limit + 1];
        var sqrt = Math.Sqrt(limit);

        for (ulong x = 1; x <= sqrt; x++)
            for (ulong y = 1; y <= sqrt; y++)
            {
                var n = 4*x*x + y*y;
                if (n <= limit && (n % 12 == 1 || n % 12 == 5))
                    isPrime[n] ^= true;

                n = 3*x*x + y*y;
                if (n <= limit && n % 12 == 7)
                    isPrime[n] ^= true;

                n = 3*x*x - y*y;
                if (x > y && n <= limit && n % 12 == 11)
                    isPrime[n] ^= true;
            }

        for (ulong n = 5; n <= sqrt; n++)
            if (isPrime[n])
                for (ulong k = n*n; k <= limit; k *= k)
                    isPrime[k] = false;

        primes.Add(2);
        primes.Add(3);
        for (ulong n = 5; n <= limit; n++)
            if (isPrime[n])
                primes.Add(n);
    }


    public IEnumerator<ulong> GetEnumerator()
    {
        if (!primes.Any())
            FindPrimes();


        foreach (var p in primes)
            yield return p;
    }


    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}
Run Code Online (Sandbox Code Playgroud)

我几乎只是试图"翻译" 维基百科上列出的伪代码,但它无法正常工作.所以要么我误解了某些东西,要么只是做错了什么.或者最有可能两者......

列出我用作测试的前500个素数,我的实现在40号(或41?)处失败.

指数值不同[40]
预期:179
但是:175

你是否能够找到我的错误,你是否有可以共享的实现,或者两者兼而有之?


我使用的确切测试看起来像这样:

public abstract class AtkinTests
{
    [Test]
    public void GetEnumerator_FirstFiveHundredNumbers_AreCorrect()
    {
        var sequence = new Atkin(2000000);
        var actual = sequence.Take(500).ToArray();
        var expected = First500;

        CollectionAssert.AreEqual(expected, actual);
    }

    private static readonly ulong[] First500 = new ulong[]
        {
            2, 3, 5, 7, 11, 13, 17, ...
        };
}
Run Code Online (Sandbox Code Playgroud)

ito*_*son 10

这段代码:

for (ulong k = n*n; k <= limit; k *= k)
  isPrime[k] = false;
Run Code Online (Sandbox Code Playgroud)

似乎不是这个伪代码的忠实翻译:

is_prime(k) ? false, k ? {n², 2n², 3n², ..., limit}
Run Code Online (Sandbox Code Playgroud)

您的代码看起来会运行n*n,n ^ 4,n ^ 8等,即每次平方而不是每次添加n平方.试试这个:

ulong nSquared = n * n;
for (ulong k = nSquared; k <= limit; k += nSquared)
  isPrime[k] = false;
Run Code Online (Sandbox Code Playgroud)


Gor*_*ood 5

Aaron Mugatroyd最后一个答案来自于Atkin的Sieve(SoA)的翻译Python源代码,但是它可以在几个方面得到改进,因为它错过了一些重要的优化,如下所示:

  1. 他的回答并没有使用完整的modulo 60原版Atkin和Bernstein版本的Sieve,而是稍微改进了维基百科文章中伪代码的变化,因此使用大约0.36的数值筛网范围组合切换/剔除操作; 下面我的代码使用合理有效的非页面段伪代码,根据我在Atkin筛选中的评论中的评论,该筛选使用约为数值范围的0.26倍的因子来将完成的工作量减少到大约一倍.七分之二.

  2. 他的代码通过仅具有奇数表示来减少缓冲区大小,而我的代码进一步用bit pack来消除可被3和5整除的数字的任何表示以及可被"仅有几率"暗示的可被2整除的数字; 这将内存需求减少了近一半(至8/15),并且有助于更好地利用CPU缓存,以便由于减少平均内存访问时间而进一步提高速度.

  3. 我的代码使用快速查找表(LUT)弹出计数技术计算素数的数量,与使用他使用的逐位技术的大约一秒相比,几乎没有时间计数; 但是,在此示例代码中,即使从代码的定时部分中取出的时间很短.

  4. 最后,我的代码优化了位操作操作,每个内部循环的代码最少.例如,它不使用连续的右移一个来产生奇数表示索引,实际上通过将所有内部循环写为常数模(等位位置)操作来实现小位移.同样,Aaron的翻译代码在操作中是非常低效的,例如在素数正方形自由剔除中,它将素数的平方加到索引然后检查奇数结果,而不是仅仅添加正方形的两倍,以便不需要检查; 然后,在内循环中执行剔除操作之前,通过将数字右移1(除以2)使得检查更加冗余,就像对所有循环一样.这个低效的代码赢得了' 使用这种"大筛网缓冲阵列"技术,大范围的执行时间差异很大,因为每次操作的大部分时间都用于RAM存储器访问(大约37个CPU时钟周期或更多,范围为10亿) ,但是对于适合CPU缓存的较小范围,执行时间要比它需要的慢得多; 换句话说,它为每个操作的执行速度设置了太高的最低限制.

代码如下:

//Sieve of Atkin based on full non page segmented modulo 60 implementation...

using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;

namespace NonPagedSoA {
  //implements the non-paged Sieve of Atkin (full modulo 60 version)...
  class SoA : IEnumerable<ulong> {
    private ushort[] buf = null;
    private long cnt = 0;
    private long opcnt = 0;
    private static byte[] modPRMS = { 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 49, 53, 59, 61 };
    private static ushort[] modLUT;
    private static byte[] cntLUT;
    //initialize the private LUT's...
    static SoA() {
      modLUT = new ushort[60];
      for (int i = 0, m = 0; i < modLUT.Length; ++i) {
        if ((i & 1) != 0 || (i + 7) % 3 == 0 || (i + 7) % 5 == 0) modLUT[i] = 0;
        else modLUT[i] = (ushort)(1 << (m++));
      }
      cntLUT = new byte[65536];
      for (int i = 0; i < cntLUT.Length; ++i) {
        var c = 0;
        for (int j = i; j > 0; j >>= 1) c += j & 1;
        cntLUT[i] = (byte)c;
      }
    }
    //initialization and all the work producing the prime bit array done in the constructor...
    public SoA(ulong range) {
      this.opcnt = 0;
      if (range < 7) {
        if (range > 1) {
          cnt = 1;
          if (range > 2) this.cnt += (long)(range - 1) / 2;
        }
        this.buf = new ushort[0];
      }
      else {
        this.cnt = 3;
        var nrng = range - 7; var lmtw = nrng / 60;
        //initialize sufficient wheels to non-prime
        this.buf = new ushort[lmtw + 1];

        //Put in candidate primes:
        //for the 4 * x ^ 2 + y ^ 2 quadratic solution toggles - all x odd y...
        ulong n = 6; // equivalent to 13 - 7 = 6...
        for (uint x = 1, y = 3; n <= nrng; n += (x << 3) + 4, ++x, y = 1) {
          var cb = n; if (x <= 1) n -= 8; //cancel the effect of skipping the first one...
          for (uint i = 0; i < 15 && cb <= range; cb += (y << 2) + 4, y += 2, ++i) {
            var cbd = cb / 60; var cm = modLUT[cb % 60];
            if (cm != 0)
              for (uint c = (uint)cbd, my = y + 15; c < buf.Length; c += my, my += 30) {
                buf[c] ^= cm; // ++this.opcnt;
              }
          }
        }
        //for the 3 * x ^ 2 + y ^ 2 quadratic solution toggles - x odd y even...
        n = 0; // equivalent to 7 - 7 = 0...
        for (uint x = 1, y = 2; n <= nrng; n += ((x + x + x) << 2) + 12, x += 2, y = 2) {
          var cb = n;
          for (var i = 0; i < 15 && cb <= range; cb += (y << 2) + 4, y += 2, ++i) {
            var cbd = cb / 60; var cm = modLUT[cb % 60];
            if (cm != 0)
              for (uint c = (uint)cbd, my = y + 15; c < buf.Length; c += my, my += 30) {
                buf[c] ^= cm; // ++this.opcnt;
              }
          }
        }
        //for the 3 * x ^ 2 - y ^ 2 quadratic solution toggles all x and opposite y = x - 1...
        n = 4; // equivalent to 11 - 7 = 4...
        for (uint x = 2, y = x - 1; n <= nrng; n += (ulong)(x << 2) + 4, y = x, ++x) {
          var cb = n; int i = 0;
          for ( ; y > 1 && i < 15 && cb <= nrng; cb += (ulong)(y << 2) - 4, y -= 2, ++i) {
            var cbd = cb / 60; var cm = modLUT[cb % 60];
            if (cm != 0) {
              uint c = (uint)cbd, my = y;
              for ( ; my >= 30 && c < buf.Length; c += my - 15, my -= 30) {
                buf[c] ^= cm; // ++this.opcnt;
              }
              if (my > 0 && c < buf.Length) { buf[c] ^= cm; /* ++this.opcnt; */ }
            }
          }
          if (y == 1 && i < 15) {
            var cbd = cb / 60; var cm = modLUT[cb % 60];
            if ((cm & 0x4822) != 0 && cbd < (ulong)buf.Length) { buf[cbd] ^= cm; /* ++this.opcnt; */ }
          }
        }

        //Eliminate squares of base primes, only for those on the wheel:
        for (uint i = 0, w = 0, pd = 0, pn = 0, msk = 1; w < this.buf.Length ; ++i) {
          uint p = pd + modPRMS[pn];
          ulong sqr = (ulong)p * (ulong)p; //to handle ranges above UInt32.MaxValue
          if (sqr > range) break;
          if ((this.buf[w] & msk) != 0) { //found base prime, square free it...
            ulong s = sqr - 7;
            for (int j = 0; s <= nrng && j < modPRMS.Length; s = sqr * modPRMS[j] - 7, ++j) {
              var cd = s / 60; var cm = (ushort)(modLUT[s % 60] ^ 0xFFFF);
              //may need ulong loop index for ranges larger than two billion
              //but buf length only good to about 2^31 * 60 = 120 million anyway,
              //even with large array setting and half that with 32-bit...
              for (ulong c = cd; c < (ulong)this.buf.Length; c += sqr) {
                this.buf[c] &= cm; // ++this.opcnt;
              }
            }
          }
          if (msk >= 0x8000) { msk = 1; pn = 0; ++w; pd += 60; }
          else { msk <<= 1; ++pn; }
        }

        //clear any overflow primes in the excess space in the last wheel/word:
        var ndx = nrng % 60; //clear any primes beyond the range
        for (; modLUT[ndx] == 0; --ndx) ;
        this.buf[lmtw] &= (ushort)((modLUT[ndx] << 1) - 1);
      }
    }

    //uses a fast pop count Look Up Table to return the total number of primes...
    public long Count {
      get {
        long cnt = this.cnt;
        for (int i = 0; i < this.buf.Length; ++i) cnt += cntLUT[this.buf[i]];
        return cnt;
      }
    }

    //returns the number of toggle/cull operations used to sieve the prime bit array...
    public long Ops {
      get {
        return this.opcnt;
      }
    }

    //generate the enumeration of primes...
    public IEnumerator<ulong> GetEnumerator() {
      yield return 2; yield return 3; yield return 5;
      ulong pd = 0;
      for (uint i = 0, w = 0, pn = 0, msk = 1; w < this.buf.Length; ++i) {
        if ((this.buf[w] & msk) != 0) //found a prime bit...
          yield return pd + modPRMS[pn]; //add it to the list
        if (msk >= 0x8000) { msk = 1; pn = 0; ++w; pd += 60; }
        else { msk <<= 1; ++pn; }
      }
    }

    //required for the above enumeration...
    IEnumerator IEnumerable.GetEnumerator() {
      return this.GetEnumerator();
    }
  }

  class Program {
    static void Main(string[] args) {
      Console.WriteLine("This program calculates primes by a simple full version of the Sieve of Atkin.\r\n");

      const ulong n = 1000000000;

      var elpsd = -DateTime.Now.Ticks;

      var gen = new SoA(n);

      elpsd += DateTime.Now.Ticks;

      Console.WriteLine("{0} primes found to {1} using {2} operations in {3} milliseconds.", gen.Count, n, gen.Ops, elpsd / 10000);

      //Output prime list for testing...
      //Console.WriteLine();
      //foreach (var p in gen) {
      //  Console.Write(p + " ");
      //}
      //Console.WriteLine();

//Test options showing what one can do with the enumeration, although more slowly...
//      Console.WriteLine("\r\nThere are {0} primes with the last one {1} and the sum {2}.",gen.Count(),gen.Last(),gen.Sum(x => (long)x));

      Console.Write("\r\nPress any key to exit:");
      Console.ReadKey(true);
      Console.WriteLine();
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

此代码的运行速度大约是Aaron代码的两倍(在i7-2700K(3.5 GHz)上使用64位或32位模式约为2.7秒,缓冲区大约为16.5兆字节,大约为25.58亿次组合切换/素数无方块剔除操作(可以通过取消注释"++ this.opcnt"语句来显示),筛选范围为10亿,相比之下,5.4/6.2秒(32位/ 64位)的代码没有计数时间和几乎两次内存使用,使用大约0.359亿次联合切换/剔除操作,筛分高达10亿.

Although it is faster than his most optimized naive odds-only implementation of the non-paged Sieve of Eratosthenes (SoE), that does not make the Sieve of Atkin faster than the Sieve of Eratosthenes, as if one applies similar techniques as used in the above SoA implementation to the SoE plus uses maximal wheel factorization, the SoE will about the same speed as this.

分析: 尽管完全优化的SoE的操作数量与筛选范围为10亿的SoA的操作数量大致相同,但是这些非分页实现的主要瓶颈是一旦筛选缓冲区大小超过内存访问CPU高速缓存大小(在一个时钟周期访问时为32千字节L1高速缓存,在大约4个时钟周期访问时间为256千字节L2高速缓存,在我的i7大约20个时钟周期访问时间时为8兆字节L3高速缓存),之后内存访问可能超过百个时钟周期.

现在,当人们将算法调整到页面分割时,两者都有大约8倍的内存访问速度提升因子,因此可以筛选出无法满足可用内存的范围.然而,由于剔除扫描的巨大进步迅速增长到数百倍,由于难以实现算法的"素数无方块"部分,因此筛选范围开始变得非常大,因此SoE继续超过SoA.页面缓冲区的大小.同样,也许更严重的是,为每个页面缓冲区的最低表示形式计算'x'的每个值的新起点以及另一个相当的'y'的值,它将获得非常大的内存和/或计算密集度.随着范围的扩大,分页SoA的效率大幅下降,与SoE相比.

EDIT_ADD: The odds-only SoE as used by Aaron Murgatroyd uses about 1.026 billion cull operations for a sieve range of one billion so about four times as many operations as the SoA and thus should run about four times slower, but the SoA even as implemented here has a more complex inner loop and especially due to a much higher proportion of the odds-only SoE culls have a much shorter stride in the culling scans than the strides of the SoA the naive odds-only SoE has much better average memory access times in spite of the sieve buffer greatly exceeding the CPU cache sizes (better use of cache associativity). This explains why the above SoA is only about twice as fast as the odds-only SoE even though it would theoretically seem to be doing only one quarter of the work.

如果使用与上述SoA相同的使用恒模数内环的类似算法并实施相同的2/3/5车轮分解,则SoE会将剔除操作的数量减少到大约0.405亿次操作,因此仅增加约50%操作比SoA,理论上运行速度稍微慢于SoA,但可能以大约相同的速度运行,因为对于这种"天真"的大内存缓冲区使用,平均步幅仍然比SoA平均小一些.将车轮因子分解增加到2/3/5/7车轮意味着对于10亿的剔除范围,SoE剔除操作减少到大约0.314,并且可能使该版本的SoE以相同的速度运行.

进一步使用车轮分解可以通过预先剔除2/3/5/7/11/13/17/19素因子的筛子阵列(模式复制),几乎不花费执行时间来减少筛选范围为10亿的剔除操作总数约为2.51亿,而且即使对于这些大型内存缓冲版本,SoE也会比SoA的运行速度更快或速度更快,SoE仍然有很多比上面更少的代码复杂性.

因此,可以看出,SoE的操作次数可以从幼稚或甚至仅赔率或2/3/5轮分解版本大大减少,使得操作的数量与SoA的操作数量大致相同.同时,由于内部循环不太复杂和内存访问效率更高,每个操作的时间实际上可能更少. END_EDIT_ADD

EDIT_ADD2: 我在这里使用类似的常量模/位位置技术为最内层循环添加代码,如上面的SoA,根据上面链接的答案下面的伪代码.尽管应用了高轮分解和预剔除,但是剔除操作的总数实际上小于SoA的组合切换/剔除操作直到筛分范围,因此代码比上述SoA复杂得多.大约20亿.代码如下:

EDIT_FINAL更正了以下代码以及与之相关的评论END_EDIT_FINAL

//Sieve of Eratosthenes based on maximum wheel factorization and pre-culling implementation...

using System;
using System.Collections;
using System.Collections.Generic;

namespace NonPagedSoE {
  //implements the non-paged Sieve of Eratosthenes (full modulo 210 version with preculling)...
  class SoE : IEnumerable<ulong> {
    private ushort[] buf = null;
    private long cnt = 0;
    private long opcnt = 0;
    private static byte[] basePRMS = { 2, 3, 5, 7, 11, 13, 17, 19 };
    private static byte[] modPRMS = { 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, //positions + 23
                                      97, 101, 103, 107, 109, 113, 121, 127, 131, 137, 139, 143, 149, 151, 157, 163,
                                      167, 169, 173, 179, 181 ,187 ,191 ,193, 197, 199, 209, 211, 221, 223, 227, 229 };
    private static byte[] gapsPRMS = { 6, 2, 6, 4, 2, 4, 6, 6, 2, 6, 4, 2, 6, 4, 6, 8,
                                       4, 2, 4, 2, 4, 8, 6, 4, 6, 2, 4, 6, 2, 6, 6, 4,
                                       2, 4, 6, 2, 6, 4, 2, 4, 2, 10, 2, 10, 2, 4, 2, 4 };
    private static ulong[] modLUT;
    private static byte[] cntLUT;
    //initialize the private LUT's...
    static SoE() {
      modLUT = new ulong[210];
      for (int i = 0, m = 0; i < modLUT.Length; ++i) {
        if ((i & 1) != 0 || (i + 23) % 3 == 0 || (i + 23) % 5 == 0 || (i + 23) % 7 == 0) modLUT[i] = 0;
        else modLUT[i] = 1UL << (m++);
      }
      cntLUT = new byte[65536];
      for (int i = 0; i < cntLUT.Length; ++i) {
        var c = 0;
        for (int j = i ^ 0xFFFF; j > 0; j >>= 1) c += j & 1; //reverse logic; 0 is prime; 1 is composite
        cntLUT[i] = (byte)c;
      }
    }
    //initialization and all the work producing the prime bit array done in the constructor...
    public SoE(ulong range) {
      this.opcnt = 0;
      if (range < 23) {
        if (range > 1) {
          for (int i = 0; i < modPRMS.Length; ++i) if (modPRMS[i] <= range) this.cnt++; else break;
        }
        this.buf = new ushort[0];
      }
      else {
        this.cnt = 8;
        var nrng = range - 23; var lmtw = nrng / 210; var lmtwt3 = lmtw * 3; 
        //initialize sufficient wheels to prime
        this.buf = new ushort[lmtwt3 + 3]; //initial state of all zero's is all potential prime.

        //initialize array to account for preculling the primes of 11, 13, 17, and 19;
        //(2, 3, 5, and 7 already eliminated by the bit packing to residues).
        for (int pn = modPRMS.Length - 4; pn < modPRMS.Length; ++pn) {
          uint p = modPRMS[pn] - 210u; ulong pt3 = p * 3;
          ulong s = p * p - 23;
          ulong xrng = Math.Min(9699709, nrng); // only do for the repeating master pattern size
          ulong nwrds = (ulong)Math.Min(138567, this.buf.Length);
          for (int j = 0; s <= xrng && j < modPRMS.Length; s += p * gapsPRMS[(pn + j++) % 48]) {
            var sm = modLUT[s % 210];
            var si = (sm < (1UL << 16)) ? 0UL : ((sm < (1UL << 32)) ? 1UL : 2UL);
            var cd = s / 210 * 3 + si; var cm = (ushort)(sm >> (int)(si << 4));
            for (ulong c = cd; c < nwrds; c += pt3) { //tight culling loop for size of master pattern
              this.buf[c] |= cm; // ++this.opcnt; //reverse logic; mark composites with ones.
            }
          }
        }
        //Now copy the master pattern so it repeats across the main buffer, allow for overflow...
        for (long i = 138567; i < this.buf.Length; i += 138567)
          if (i + 138567 <= this.buf.Length)
            Array.Copy(this.buf, 0, this.buf, i, 138567);
          else Array.Copy(this.buf, 0, this.buf, i, this.buf.Length - i);

        //Eliminate all composites which are factors of base primes, only for those on the wheel:
        for (uint i = 0, w = 0, wi = 0, pd = 0, pn = 0, msk = 1; w < this.buf.Length; ++i) {
          uint p = pd + modPRMS[pn];
          ulong sqr = (ulong)p * (ulong)p;
          if (sqr > range) break;
          if ((this.buf[w] & msk) == 0) { //found base prime, mark its composites...
            ulong s = sqr - 23; ulong pt3 = p * 3;
            for (int j = 0; s <= nrng && j < modPRMS.Length; s += p * gapsPRMS[(pn + j++) % 48]) {
              var sm = modLUT[s % 210];
              var si = (sm < (1UL << 16)) ? 0UL : ((sm < (1UL << 32)) ? 1UL : 2UL);
              var cd = s / 210 * 3 + si; var cm = (ushort)(sm >> (int)(si << 4));
              for (ulong c = cd; c < (ulong)this.buf.Length; c += pt3) { //tight culling loop
                this.buf[c] |= cm; // ++this.opcnt; //reverse logic; mark composites with ones.
              }
            }
          }
          ++pn;
          if (msk >= 0x8000) { msk = 1; ++w; ++wi; if (wi == 3) { wi = 0; pn = 0; pd += 210; } }
          else msk <<= 1;
        }

        //clear any overflow primes in the excess space in the last wheel/word:
        var ndx = nrng % 210; //clear any primes beyond the range
        for (; modLUT[ndx] == 0; --ndx) ;
        var cmsk = (~(modLUT[ndx] - 1)) << 1; //force all bits above to be composite ones.
        this.buf[lmtwt3++] |= (ushort)cmsk;
        this.buf[lmtwt3++] |= (ushort)(cmsk >> 16);
        this.buf[lmtwt3] |= (ushort)(cmsk >> 32);
      }
    }

    //uses a fast pop count Look Up Table to return the total number of primes...
    public long Count {
      get {
        long cnt = this.cnt;
        for (int i = 0; i < this.buf.Length; ++i) cnt += cntLUT[this.buf[i]];
        return cnt;
      }
    }

    //returns the number of cull operations used to sieve the prime bit array...
    public long Ops {
      get {
        return this.opcnt;
      }
    }

    //generate the enumeration of primes...
    public IEnumerator<ulong> GetEnumerator() {
      yield return 2; yield return 3; yield return 5; yield return 7;
      yield return 11; yield return 13; yield return 17; yield return 19;
      ulong pd = 0;
      for (uint i = 0, w = 0, wi = 0, pn = 0, msk = 1; w < this.buf.Length; ++i) {
        if ((this.buf[w] & msk) == 0) //found a prime bit...
          yield return pd + modPRMS[pn];
        ++pn;
        if (msk >= 0x8000) { msk = 1; ++w; ++wi; if (wi == 3) { wi = 0; pn = 0; pd += 210; } }
        else msk <<= 1;
      }
    }

    //required for the above enumeration...
    IEnumerator IEnumerable.GetEnumerator() {
      return this.GetEnumerator();
    }
  }

  class Program {
    static void Main(string[] args) {
      Console.WriteLine("This program calculates primes by a simple maximually wheel factorized version of the Sieve of Eratosthenes.\r\n");

      const ulong n = 1000000000;

      var elpsd = -DateTime.Now.Ticks;

      var gen = new SoE(n);

      elpsd += DateTime.Now.Ticks;

      Console.WriteLine("{0} primes found to {1} using {2} operations in {3} milliseconds.", gen.Count, n, gen.Ops, elpsd / 10000);

//      Console.WriteLine();
//      foreach (var p in gen) {
//        Console.Write(p + " ");
//      }
//      Console.WriteLine();

      //      Console.WriteLine("\r\nThere are {0} primes with the last one {1} and the sum {2}.",gen.Count(),gen.Last(),gen.Sum(x => (long)x));

      Console.Write("\r\nPress any key to exit:");
      Console.ReadKey(true);
      Console.WriteLine();
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

这个代码实际运行速度比上面的SoA快了几个百分点,因为操作稍微少一点,这个大型数组大小的主要瓶颈是10亿个范围内的存储器访问时间,大概是40到100个CPU时钟周期取决于CPU和内存规格; 这意味着代码优化(除了减少操作总数之外)是无效的,因为大部分时间花在等待内存访问上.无论如何,使用巨大的内存缓冲区并不是筛选大范围的最有效方法,使用具有相同最大轮分解的页面分割的SoE最多可提高8倍(这也为此铺平了道路多处理).

正是在实现页面分割和多处理时,与SoE相比,SoA实际上不足以超过40亿的范围,因为由于SoA的渐近复杂度降低而导致的任何增益都会迅速被与页面处理相关的页面处理开销因素所吞噬.素数广场免费处理和计算更大数量的页面起始地址; 或者,人们通过将标记存储在RAM存储器中以大量的存储器消耗和访问这些标记存储结构的低效率来克服这一点. END_EDIT_ADD2

简而言之,与全轮分解SoE相比,SoA实际上并不是一个实用的筛子,因为正如渐近复杂度的增益开始使性能接近完全优化的SoE,它开始失去效率,因为关于相对存储器访问时间和页面分割复杂性以及通常更复杂和难以编写的实际实现的细节.在我看来,与SoE相比,它更像是一个有趣的智力概念和心理练习,而不是一个实用的筛子.

Some day I will adapt these techniques to a multi-threaded page segmented Sieve of Eratosthenes to be about as fast in C# as Atkin and Bernstein's "primegen" implementation of the SoA in 'C' and will blow it out of the water for large ranges above about four billion even single threaded, with an extra boost in speed of up to about four when multi-threading on my i7 (eight cores including Hyper Threading).