Prime generator

For fun (and no particular better reason), I have a long tradition of liking to write programs that generate prime numbers. All prime numbers up to some limit.

Here, I do it in C++ and try to squeeze performance out what I do.

Some background on this is on my home page.

What is a prime number?

A positive integer is prime if there exist exactly two positive integers that are divisors.

  • Any number n is divisible by 1 and by n itself.

  • If any other divisors exist on top of these, the number isn't prime.

  • We need to have two different divisors. This rules out 1, which has only one positive divisor, namely, itself. So 1 isn't prime.

First program

That definition straightforwardly leads to a first stupid_prime program:

// ui is unsigned long or unsigned long long
template<typename ui> void driven(ui max_prime,
                                  std::function<void(ui)> prime_receiver) {
  for(ui maybe_prime = 2; maybe_prime <= max_prime; ++maybe_prime) {
    bool is_prime = true;
    for(ui divisor = 2; is_prime && divisor < maybe_prime; ++divisor)
      is_prime = 0 != maybe_prime % divisor;
    if(is_prime)
      prime_receiver(maybe_prime);
  }
}

This, of course, works:

In [1]:
import primes_driver as pd
pd.run("stupid_prime", "50")
2
3
5
7
11
13
17
19
23
29
31
37
41
43
47

But this is a slow algorithm. While it features a small (constant) memory footprint, its runtime is quadratic.

It's not worth spending much time on this algorithm, but for the record, here is a (crude) performance estimate:

In [2]:
stupid_prime_data = pd.report("stupid_prime");
Result from 3 runs on one particular (consumer class) laptop:
The memory usage is approximately 7.31e+04 kB.
The time is approximately 3.98e-10 * N**2 + 0.04.
In [3]:
pd.runtime_plot(stupid_prime_data)

How the performance estimates are done

A prime number generation algorithm is run with increasing max_prime values: 1e4, 3e4, 1e5, 3e5, 1e6, 3e6 and so on. Resource usage of each run is recorded and collected.

The runs terminate when their combined (wall-clock) runtime exceeds a fixed amount of time, presently a mere 30 s. Afterwards, a few "usual suspects" are used to provide a fit to that resource usage data, in particular, total time used (user + system) and memory consumption. The best fit is chosen and displayed, as a formula. The primes_max variable is abbreviated as N in those formulas.

So runtime performance is not duly analyzed, but only measured. The resulting formulas should be taken with a decent lump of salt.

Tests

I'm usually a fan of automated testing. Here, I restrict myself to doing end-result testing only.

The prime numbers are generated in a straightforward "one prime number per line" format (regular expression (\d+\n)*). So, when driven with the same max_prime input parameter, different prime number generators should all generate identical outputs.

This (alone) is checked: A SHA256 checksum of each entire output stream is taken and compared with previous results.

Test divisors to the root

Back to prime number generation. We want it to become faster.

A first observation: When a a * b = c (all assumed positiv), then not both of a and b can exceed sqrt(c).

A second observation: If one divides c by some sequence of numbers thats increasing without limits, e.g., a = 1, 2, 3, ..., then sqrt(c) is surpassed at the point when c / a becomes smaller than a.

Based on these observations, a faster algorithm is this:

template<typename ui> void gen_primes(ui max_prime,
                                  std::function<void(ui)> prime_receiver) {
  for(ui maybe_prime = 2; maybe_prime <= max_prime; ++maybe_prime) {
    bool is_prime = true;
    for(ui divisor = 2; is_prime; ++divisor) {

      // Division and remainder is really the same calculation.
      // We hope the compiler sees that, too, and optimizes it:
      ui q = maybe_prime / divisor;
      ui rem = maybe_prime % divisor;

      if (q < divisor) {
        // Divisor is already above the sqrt of q,
        // so we are done:
        break;
      }
      is_prime = 0 != rem;
    }

    if(is_prime)
      prime_receiver(maybe_prime);
  }
}

In passing, I would have liked to use something like std::div instead of writing the above comment, but found it for signed integral types only.

In [4]:
div_by_all_upto_root_data = pd.report("div_by_all_upto_root")
Result from 7 runs on one particular (consumer class) laptop:
The memory usage is approximately 9.31e+04 kB.
The time is approximately 7.71e-09 * N**1.5/log(N) + 0.0305.

Runtime is down from O(N**2) to O(N**1.5 / log(N)). However, this needs to be taken with rather a grain of salt, as this is not the result of a thorough analysis of the algorithm, but only of data fitting on a handful of measurements taken on my laptop.

That said, here is a double-log plot of the time the algorithm actually took, vs. the time predicted by the approximation formula:

In [5]:
pd.runtime_plot(div_by_all_upto_root_data)

Wheel

How to speed up further?

The next obvious remark: We do not need to test all numbers. Just testing the odd ones is enough. Beyond 2, even numbers cannot be prime.

This reasoning can be extended straightforwardly. Instead of multiples of 2, disregard multiples of the first few primes.

To give an example, let us assume we want to generate a stream of those numbers not divisible by any of the initial primes 2, 3, and 5.

The pattern of these numbers repeats, with a period of 2 * 3 * 5 = 30, in this sense: One can decide whether a number n is divisible by any of those three if one knows n % (2 * 3 * 5) = n % 30.

For actual computational purposes, one can store the numbers <= 30 not divisible by 2, 3, and 5, and obtain any larger such numbers by adding appropriate multiples of 30.

This is called the "wheel" trick. Each addition of 30 is thought to start a new revolution of the wheel.

The gist of my implementation: It stores the low non-divisibles (e.g., {1, 7, 11, 13, 17, 19, 23, 29}) in some std::vector<ui> data_, and the period (e.g., 30), in ui period_.

To consume those numbers, a std::function<bool(ui)> receiver is provided that knows what to do with them. Its return value has the semantics "keep going", so false is a signal for stopping the wheel.

The actual original implementation had some additional sophistication, but the gist of the wheel is code like this:

for(ui base{0}; true; base += period_) {
  for(ui offset: data_) {
    ui next = base + offset;

    if(upto < next)
      return;

    if( ! receiver(next))
      return;
  }
}

(I later rewrote the wheel to iterators, so this code is gone.)

Such a wheel can be used at two levels: For the generator of maybe_prime candidates, and for the list of divisors used to test maybe_prime.

This results in the wheel_generator prime number generation algorithm.

Here is an example using the initial primes list {2, 3, 5, 7, 11, 13}:

In [6]:
wheel_data = pd.report("wheel_generator")
Result from 8 runs on one particular (consumer class) laptop:
The memory usage is approximately 9.72e+04 kB.
The time is approximately 2.06e-09 * N**1.5/log(N) + 0.0353.
In [7]:
pd.runtime_plot(wheel_data)

How big to make the wheel?

The wheel data size increases p - 1 fold when a new prime p is added to the wheel logic.

The larger the wheel gets, the more RAM is needed and the less effective CPU caches are. For this (and other) reasons, it is not wise to increase the wheel beyond limits.

To experiment, I introduced a configuration environment variable WHEEL_INITIAL_PRIMES_UPTO. On my laptop, an increase in runtime is increasing this from 19 to 23.

Going one step further, using initial primes up to 29, leads to "catastrophic failure": The machine starts to swap heavily, becomes next to unusable, and prime generation performance drops to dismal levels. (Lesson learned: Remove swap space. The machine remains usable and a fast, hard crash of the prime generator results.)

Here is a parameter study (maximum_mem_size in kB and time in seconds):

In [8]:
pd.wheel_generation_parameter_study(max_prime = 10000000)
Out[8]:
algo maximum_mem_size time
0 div_upto_root 108600.0 15.156600
1 wheel upto 2 108600.0 5.790098
2 wheel upto 3 108600.0 5.772750
3 wheel upto 5 108600.0 5.783847
4 wheel upto 7 108600.0 5.026148
5 wheel upto 11 108600.0 4.493480
6 wheel upto 13 108600.0 4.201395
7 wheel upto 17 108600.0 3.999258
8 wheel upto 19 108600.0 3.709438
9 wheel upto 23 542948.0 4.162888

Serious sieving

Now we have wheels. How to filter further? Two ideas bring further improvements:

  • We only need to filter multiples of primes.

  • We do not need to filter stuff that has already been filtered by the wheel.

To this end, a sieve class is designed. We want one object of this class for each prime p below the root of max_prime that wasn't already used in wheel construction.

The sieve should filter out all n * p, where n is any number produced by the wheel.

(A multiple n * p is produced by the wheel if and only if n * p is not a multiple of one of the wheel's initial primes. As p is a prime and not among those initial primes, this is true exactly if n itself is not a multiply of the initial primes. This is true exactly if n is produced by the wheel.)

An iterator that produces all numbers generated by the wheel, that is, a wheel_iterator. This can be coded easily.

Here is a sieve implementation. For the intended algorithm, it needs to report what the next number is it wants to have removed.

template <typename ui, class InputIt> class sieve {
private:
  const ui p_;
  InputIt base_;
  ui next_removal_;

public:
  explicit sieve(InputIt base, ui p) : p_(p), base_(base) {
    next_removal_ = *base_ * p_;
  }

  ui next_removal() const {return next_removal_;}

  sieve<ui, InputIt>& operator++() {
    ++base_;
    next_removal_ = *base_ * p_;
    return *this;
  }
}

To mix what these various sieves have to say, a priority queue can be used conveniently:

typedef sieve<ui, wheel_iterator<ui>> sv;

// Sort by which sieve filters next:
struct lowest_first {
  bool operator()(const sv* lhs, const sv* rhs) {
    return lhs->next_removal() > rhs->next_removal();
  }
};
std::priority_queue<sv*, std::vector<sv*>, lowest_first> queue;

This queue gets filled with the sieves. For the utmost in performance, plain raw pointers are used. (There is a std::vector<std::unique_ptr<sv>> container in the same scope which will take care of sieve cleanup.)

The basic algorithm step involves incrementing the iterator at the top of the queue, so its next_removal() increases, and then reshuffling the queue. As I could not find a "replace top with new value" method, I implemented this functionality with a pop, followed by a push. (There goes "utmost in performance". However, a planned future algorithm will make this non-consequential.)

auto step_queue = [&queue]() -> ui {
  auto active_sieve = queue.top();
  queue.pop();
  ++(*active_sieve);
  queue.push(active_sieve);
  return queue.top()->next_removal();
};

With that, the inner loop of this prime generation algorithm becomes:

ui next_to_filter = queue.top()->next_removal();
for(ui maybe_prime: wheel.range(2, max_prime)) {
  if(maybe_prime < next_to_filter) {
    prime_receiver(maybe_prime);        
  } else {
    while(next_to_filter < maybe_prime)
      next_to_filter = step_queue();

    if(next_to_filter == maybe_prime)
      do
        next_to_filter = step_queue();
      while(next_to_filter == maybe_prime);
    else
      prime_receiver(maybe_prime);

  }
}

Inspired by our previous measurements, initial primes up to 19 are used in the wheel.

Here is the result:

In [9]:
import os
os.environ["WHEEL_INITIAL_PRIMES_UPTO"] = "19"
sieve_data = pd.report("sieve_generator")
Result from 10 runs on one particular (consumer class) laptop:
The memory usage is approximately 1.09e+05 kB.
The time is approximately 1.33e-09 * N*log(N) + 0.0215.
In [10]:
pd.runtime_plot(sieve_data)

So, we apparently have improved performance from O(N**1.5/log(N)) to O(N*log(N)).

Our performance measurements (which squeezes as many runs as possible into 30 seconds) have for the first time allowed an algorithm to generate all primes up to 3e8.

(To be continued.)