Wow, I've forgotten that feeling when a small code change makes for a huge speedup.
I'm implementing big number arithmetic again and am at multiplication. Aiming to merely see it work, I started with the naive gradeschool algorithm, scanning left to right across multiplicand digits for each digit in the multiplier. A large integer would be initialized and used to collect each pass, get shifted to the correct column, and added to an accumulator.
But Knuth's The Art of Computer Programming has a huge speedup. I have difficulty following the presentation in this book, as it requires more working memory than I have to keep track of the non-descriptive single variable names and subscripts. Knuth's algorithm ends up being the same as the gradeschool algorithm, but cleverly keeps the sum of intermediate products in one array of digits. That eliminated an allocation and shift of an intermediate big number and sped things up over 3x!
for(i=0; i<BN_DWORD_LEN; ++i) {
uint32_t carry = 0;
for(j=0; j<BN_DWORD_LEN; ++j) {
uint64_t tmp = (uint64_t)(a->digits[j]) * b->digits[i] + accum[i+j] + carry;
carry = tmp >> 32;
accum[i+j] = (uint32_t)tmp;
}
accum[i+j] = carry;
}
The exact book is The Art of Computer Programming 3rd Edition, Volume 2, Seminumerical Algorithms, 4.3.1b.