The maths behind passwordmaker-rs-0.2

What is new in version 0.2?

From a user's perspective not much has changed between versions 0.1 and 0.2. The API is unchanged, and unless there is a bug, the output is the same too. This of course raises the question what actually changed between the two versions.

The short answer is "performance".

Of course that's not the only change. Apart from the performance changes there were also significant improvements to the code quality and the test coverage improved greatly. The performance changes also led to the introduction of feature flags that can be enabled to improve performance at the cost of increased binary size.

Base conversion in 0.1

This blog is about the biggest change with respect to performance, namely the rewrite of the base conversion code. To understand why it was rewritten, you need to know what version 0.1 did in order to convert between different bases.

But first things first. The base conversion is the part of PasswordMaker Pro (and therefore also of passwordmaker-rs) that maps the hash value which gets generated from the user's input onto the list of characters that the user allows to have in the generated password. PasswordMaker Pro chose to do this in a mathematically correct way, meaning that the most significant digit of the converted number determines the first character in the generated password (or password-part if the mapped hash is shorter than the desired password-length). This has two consequences:

The first character of each password-part has a higher probability to be from the beginning of the list of allowed characters. That is because the size of the hash is typically not a power of the length of the list of password characters. I don't think this can be exploited, but I'm not sure and from a gut feeling I would have preferred if the most significant digit would have been ignored.
The second consequence is about performance. The usual algorithm to convert between bases yields the least significant digit first. This means, in order to get the most significant digit the whole number needs to be converted first, and the result needs to be collected into a buffer in order to allow reversing it.

As a reminder, the usual formula to convert a number from base A to base B using arthmetic in base A is to repeatedly divide it by that base, and to note the remainders until the quotient is zero. Reading the remainders in reverse yields the desired number. As an example, let's convert the number 12345 to hexadecimal.

\begin{matrix} \frac{12345}{16} & = & 771 & + & \frac{9}{16} \\ \frac{771}{16} & = & 48 & + & \frac{3}{16} \\ \frac{48}{16} & = & 3 & + & \frac{0}{16} \\ \frac{3}{16} & = & 0 & + & \frac{3}{16} \end{matrix}

The remainders, read from bottom to top, are 0x3039. However, if we would only have cared for the first two digits of that number, we still would have had to complete the whole conversion. Now, in the case of cryptographic hashes the numbers are much larger, of course. The smallest hashes supported by PasswordMaker Pro are 16 bytes long.

By default, PasswordMaker Pro uses 94 characters for password generation. With a 16 bytes hash this yields 20 characters per password-part. Typical user passwords are probably about half that length. For larger hashes an even lower fraction of the generated digits is used. 20 bytes yield 25 characters, 32 yield 40 per password-part. In addition, the converted number's digits need to be stored, and in the worst case this can mean storing 256 digits. Since the digits of base conversion are used as indices into a user-supplied array, they can, depending on user input, go up to <usize>::MAX. In other words, the 256 digits case would take 2 kibibytes of memory, what is a significant portion of the L1 cache even on modern CPUs.

Goals for the base conversion in 0.2

While this obviously is not a big issue for normal use cases of passwordmaker-rs, it still annoyed me that the library is doing work that's then thrown away... So I decided to try to optimize this code. This left me with the most important question though: How? Knowing how it should not work, makes the goals for an improvement rather obvious:

The only real way to be able to skip allocating the buffer is to implement the base conversion such that it generates the digits in the order that they are needed in, starting with the most significant digit.
It would also be nice to implement this in the form of an iterator, so that taking only the desired number of characters would allow the computation to early-out.
Since any reformulation will likely increase the complexity of the code, a slight performance regression in worst-case scenarios is acceptable, but only under the condition that typical user inputs see a significant improvement.

Base conversion starting at most significant digit

The idea for the alternative algorithm comes from the inverse operation of the base conversion presented above. To convert from base A to base B using base B arithmetics, one has to start at the most significant digit, multiply it by the base, add the next digit, multiply the result by the base, add the next digit, multiply, add, and so on and so forth. For the example shown above, this would read:
$((((3) \cdot 16) + 0) \cdot 16 + 3) \cdot 16 + 9 = 12345$

The same formula can be rewritten by expanding the multiplications ( $b$ denotes the base, $d_{i}$ denotes the i'th digit, where $d_{0}$ is the least significant digit.):

\sum_{i = 0}^{N} b^{i} \cdot d_{i}

For the example this would read:

3 \cdot 16^{3} + 0 \cdot 16^{2} + 3 \cdot 16^{1} + 9 \cdot 16^{0} = 12345

Based on this formula, it's straightforward to formulate the desired algorithm:

The input is the starting value for the dividend.
Find the highest power of the base that is still smaller or equal to the input. This is the starting value for the divisor.
Divide the dividend by the divisor. The quotient is the next digit of the result.
The remainder is the new value for the dividend.
Divide the previous divisor by the base. This is the new divisor.
Repeat steps 3-5 until the divisor is equal to 0.

Let's go through our example again, and convert 12345 to hexadecimal using this algorithm.

The starting dividend is 12345.
The highest power of 16 that's equal to or smaller than 12345 is 4096 ( $16^{3}$ ).
$\frac{12345}{4096} = 3 + \frac{57}{4096}$ , so our first digit is 3.
The remainder was 57. This is our new dividend.
The new divisor is given by $\frac{4096}{16} = 256$ .
$\frac{57}{256} = 0 + \frac{57}{256}$ , so our second digit is 0.
The remainder was 57 (again). This is our new dividend.
The new divisor is given by $\frac{256}{16} = 16$ .
$\frac{57}{16} = 3 + \frac{9}{16}$ , so our third digit is 3.
The remeinader was 9. This is our new dividend.
The new divisor is given by $\frac{16}{16} = 1$ .
$\frac{9}{1} = 9 + \frac{0}{1}$ , so our fourth digit is 9.
The remainder was 0. This would be the new dividend.
However we don't need it any more, because $\frac{1}{16} = 0 + \frac{1}{16}$ , so our next divisor would be zero. This indicates that the conversion is complete.

As expected, we have reached the same result as above, 0x3039, but this time starting at the most significant bit.

Implementation in passwordmaker-rs

For the hashes that fit into u128 the implementation is straightforward, using the arithmetics defined for this data type. It would be tempting to use the num_bigint crate for the bigger hashes, but that crate uses heap allocations, because it cannot make assumptions about the size of the data stored in the BigInt type. The size of the hashes used in passwordmaker-rs is known at compile time, so it is quite tempting to use a stack-allocated Sized type instead.

To make this possible, arithmetic for numbers of 20 and 32 bytes has been implemented, using a positional notation with a base of $2^{32}$ . For multiplication the school method was used, and division has been implemented following Donald E. Knuth, The Art of Computer Programming, Vol. 2, Section 4.3, Algorithm D.. This is the same algorithm that BigInt uses.

Optimizations

While the above works, it is not particularly fast. Having to find the highest power of the base that fits into the input takes time, and dividing the divisor by the base between each iteration is not optimal either.

The search for the highest power of the base that fits into the input can be sped up by increasing the power quadratically instead of linearly, and only switching back to linear search for the last few steps.

The search can be skipped altogether though. The chance that the highest power of the base that fits into the input's data type is the same as the largest one that fits the input value is rather high, and even if not, any leading zeros can just be skipped. What this means is that instead of doing a search at runtime, one can precompute the highest fitting powers for various possible bases, and trade a bit of binary size for quite a significant gain in performance. Of course it's not feasible to precompute this for all possible values of usize, but at least for values the users are expected to use. The runtime search is used as a fallback, in case a base that has not been precomputed is required. In passwordmaker-rs the number of precomputed values can be tweaked with feature flags.

Under the (quite justified) assumption that multiplication is faster than division, the algorithm for base conversion can be modified:

The input is the starting value for the dividend.
Find the highest power of the base that fits into the input's data type. This is the starting value for the divisor. Also store the exponent plus one, as it is the number of digits of the result.
Divide the dividend by the divisor. The quotient is the first digit of the result.
The remainder is the new value of the dividend.
Divide the previous divisor by the base. This is the new divisor.
Divide the dividend by the divisor. The quotient is the next digit of the result.
Take the remainder and multiply it by the base. This is the new dividend.
Repeat steps 6 and 7 until the number of digits computed in step 2 has been returned.

The first division of the divisor by the base (step 5) is required to avoid overflow.

Even though this modification means that the number of digits in the division (step 7) does not decrease over time, the overall performance improved a lot on the tested hardware.

Let's go through the example of converting 12345 to hexadecimal one last time, with this modified algorithm. For simplicity let's assume our data type can store decimal values up to 5 digits, so the largest value is 99999.

The starting value for the dividend is 12345.
Since our example data type goes up to 99999, the highest power of 16 we can fit into it is $16^{4} = 65536$ . This means that our output will have $4 + 1 = 5$ digits, and that 65536 will be our first divisor.
$\frac{12345}{65536} = 0 + \frac{12345}{65536}$ means that our first digit is 0. Leading zeros are unlikely in practice because a good hash has a 50% chance for each bit of the output to be 1.
The remainder is 12345. This is our new dividend.
The new divisor is given by $\frac{65536}{16} = 4096$ .
$\frac{12345}{4096} = 3 + \frac{57}{4096}$ , so the second digit is 3
We now leave the divisor unchanged, and rather multiply the remainder by the base to get the new dividend: $57 \cdot 16 = 912$ .
$\frac{912}{4096} = 0 + \frac{912}{4096}$ , so the third digit is 0.
The next dividend is $912 \cdot 16 = 14592$ .
$\frac{14592}{4096} = 3 + \frac{2304}{4096}$ , so the fourth digit is 3.
The next dividend is $2304 \cdot 16 = 36864$ .
$\frac{36864}{4096} = 9 + \frac{0}{4096}$ , so our fifth and final digit is 9.

Again we reached our desired value of 0x03039. The leading zero can easily be skipped, for instance by using the std::iter::Iterator::skip_while<P>(self, predicate: P) method.

One thing that is worth noting is that it might look tempting to change the condition for conversion completion from pre-determining the number of digits to "if the dividend is zero", but that would be wrong in cases where there are trailing zeros.

Results

The benchmarks were performed with a target-basis of 94 and hard-coded hash values. The numbers posted here are recorded on an AMD Ryzen 1700X processor. For the exact input parameters of the benchmarks, please check the benchmark source code (beware, the parameters labelled "worst case" in the source code are worst case for version 0.1 - the worst case for version 0.2 is labelled "full divide").

Typical

For a "typical" password length of 12 characters, all hash lengths show a clear improvement over version 0.1:

For 16 byte hashes the performance improved by approximately 45%.
For 20 byte hashes the performance improved by approximately 30%.
For 32 byte hashes the performance improved by approximately 50%.

Worst case

However, for worst case scenarios, in which all digits of the result are required (20 for 16 bytes, 25 for 20 bytes, and 40 for 32 bytes), the results are not that great:

For 16 byte hashes the performance improved by approximately 30%.
For 20 byte hashes the performance got worse by approximately 10%.
For 32 byte hashes the performance got worse by approximately 10%.

It is worth noting that for even longer passwords the relative performance of version 0.2 compared to 0.1 improves again. Due to this and the very significant improvement for typical password lengths, I still consider version 0.2 a huge improvement over 0.1. Also, most users will likely use the default hash algorithm, which has 16 byte hashes and gains performance even in the worst case scenario.

The actual gains/losses in performance depend on the user's hardware though. I don't have the numbers any more, but I also profiled the code on a Raspberry Pi and on that slower hardware the performance of version 0.2 easily outperformed version 0.1 for all possible input parameters. I won't reproduce this right now though, because compilation on the Raspberry Pi takes several hours.

Plans for the future

The choice of u32 as the base of the number sytem in which the arithmetic is implemented was mostly based on gut feeling. The division algorithm from TAOCP requires an error-correction step which is less likely to be required the larger the base is. In addition, 32 is the greatest common divisor of 160 and 256, so it's equally suited for both hash sizes. It would be worth investigating how switching to u64 or u8 affects performance.

There is likely still optimization potential in the arithmetic functions. It is for instance not clear why the normalization step in the division function takes as long as it does. Possibly those functions can still be reformulated to reduce their CPU time costs.

Previous: Why is this blog so ugly? Home Next: Dosbox with MIDI on the Steam Deck