From a user's perspective not much has changed between versions 0.1 and 0.2. The API is unchanged, and unless there is a bug, the output is the same too. This of course raises the question what actually changed between the two versions.
The short answer is "performance".
Of course that's not the only change. Apart from the performance changes there were also significant improvements to the code quality and the test coverage improved greatly. The performance changes also led to the introduction of feature flags that can be enabled to improve performance at the cost of increased binary size.
This blog is about the biggest change with respect to performance, namely the rewrite of the base conversion code. To understand why it was rewritten, you need to know what version 0.1 did in order to convert between different bases.
But first things first. The base conversion is the part of PasswordMaker Pro (and therefore also of passwordmaker-rs) that maps the hash value which gets generated from the user's input onto the list of characters that the user allows to have in the generated password. PasswordMaker Pro chose to do this in a mathematically correct way, meaning that the most significant digit of the converted number determines the first character in the generated password (or password-part if the mapped hash is shorter than the desired password-length). This has two consequences:
As a reminder, the usual formula to convert a number from base A to base B using arthmetic in base A is to repeatedly divide it by that base, and to note the remainders until the quotient is zero. Reading the remainders in reverse yields the desired number. As an example, let's convert the number 12345 to hexadecimal.
The remainders, read from bottom to top, are 0x3039. However, if we would only have cared for the first two digits of that number, we still would have had to complete the whole conversion. Now, in the case of cryptographic hashes the numbers are much larger, of course. The smallest hashes supported by PasswordMaker Pro are 16 bytes long.
By default, PasswordMaker Pro uses 94 characters for password generation. With a 16 bytes hash this yields 20 characters per password-part. Typical user passwords are probably about half that length. For larger hashes an even lower fraction of the generated digits is used. 20 bytes yield 25 characters, 32 yield 40 per password-part. In addition, the converted number's digits need to be stored, and in the worst case this can mean storing 256 digits. Since the digits of base conversion are used as indices into a user-supplied array, they can, depending on user input, go up to <usize>::MAX
. In other words, the 256 digits case would take 2 kibibytes of memory, what is a significant portion of the L1 cache even on modern CPUs.
While this obviously is not a big issue for normal use cases of passwordmaker-rs, it still annoyed me that the library is doing work that's then thrown away... So I decided to try to optimize this code. This left me with the most important question though: How? Knowing how it should not work, makes the goals for an improvement rather obvious:
The idea for the alternative algorithm comes from the inverse operation of the base conversion presented above. To convert from base A to base B using base B arithmetics, one has to start at the most significant digit, multiply it by the base, add the next digit, multiply the result by the base, add the next digit, multiply, add, and so on and so forth. For the example shown above, this would read:
The same formula can be rewritten by expanding the multiplications ( denotes the base, denotes the i'th digit, where is the least significant digit.):
For the example this would read:
Based on this formula, it's straightforward to formulate the desired algorithm:
Let's go through our example again, and convert 12345 to hexadecimal using this algorithm.
As expected, we have reached the same result as above, 0x3039, but this time starting at the most significant bit.
For the hashes that fit into u128
the implementation is straightforward, using the arithmetics defined for this data type. It would be tempting to use the num_bigint
crate for the bigger hashes, but that crate uses heap allocations, because it cannot make assumptions about the size of the data stored in the BigInt
type. The size of the hashes used in passwordmaker-rs is known at compile time, so it is quite tempting to use a stack-allocated Sized
type instead.
To make this possible, arithmetic for numbers of 20 and 32 bytes has been implemented, using a positional notation with a base of . For multiplication the school method was used, and division has been implemented following Donald E. Knuth, The Art of Computer Programming, Vol. 2, Section 4.3, Algorithm D.. This is the same algorithm that BigInt
uses.
While the above works, it is not particularly fast. Having to find the highest power of the base that fits into the input takes time, and dividing the divisor by the base between each iteration is not optimal either.
The search for the highest power of the base that fits into the input can be sped up by increasing the power quadratically instead of linearly, and only switching back to linear search for the last few steps.
The search can be skipped altogether though. The chance that the highest power of the base that fits into the input's data type is the same as the largest one that fits the input value is rather high, and even if not, any leading zeros can just be skipped. What this means is that instead of doing a search at runtime, one can precompute the highest fitting powers for various possible bases, and trade a bit of binary size for quite a significant gain in performance. Of course it's not feasible to precompute this for all possible values of usize
, but at least for values the users are expected to use. The runtime search is used as a fallback, in case a base that has not been precomputed is required. In passwordmaker-rs the number of precomputed values can be tweaked with feature flags.
Under the (quite justified) assumption that multiplication is faster than division, the algorithm for base conversion can be modified:
The first division of the divisor by the base (step 5) is required to avoid overflow.
Even though this modification means that the number of digits in the division (step 7) does not decrease over time, the overall performance improved a lot on the tested hardware.
Let's go through the example of converting 12345 to hexadecimal one last time, with this modified algorithm. For simplicity let's assume our data type can store decimal values up to 5 digits, so the largest value is 99999.
Again we reached our desired value of 0x03039. The leading zero can easily be skipped, for instance by using the std::iter::Iterator::skip_while<P>(self, predicate: P)
method.
One thing that is worth noting is that it might look tempting to change the condition for conversion completion from pre-determining the number of digits to "if the dividend is zero", but that would be wrong in cases where there are trailing zeros.
The benchmarks were performed with a target-basis of 94 and hard-coded hash values. The numbers posted here are recorded on an AMD Ryzen 1700X processor. For the exact input parameters of the benchmarks, please check the benchmark source code (beware, the parameters labelled "worst case" in the source code are worst case for version 0.1 - the worst case for version 0.2 is labelled "full divide").
For a "typical" password length of 12 characters, all hash lengths show a clear improvement over version 0.1:
However, for worst case scenarios, in which all digits of the result are required (20 for 16 bytes, 25 for 20 bytes, and 40 for 32 bytes), the results are not that great:
It is worth noting that for even longer passwords the relative performance of version 0.2 compared to 0.1 improves again. Due to this and the very significant improvement for typical password lengths, I still consider version 0.2 a huge improvement over 0.1. Also, most users will likely use the default hash algorithm, which has 16 byte hashes and gains performance even in the worst case scenario.
The actual gains/losses in performance depend on the user's hardware though. I don't have the numbers any more, but I also profiled the code on a Raspberry Pi and on that slower hardware the performance of version 0.2 easily outperformed version 0.1 for all possible input parameters. I won't reproduce this right now though, because compilation on the Raspberry Pi takes several hours.
The choice of u32
as the base of the number sytem in which the arithmetic is implemented was mostly based on gut feeling. The division algorithm from TAOCP requires an error-correction step which is less likely to be required the larger the base is. In addition, 32 is the greatest common divisor of 160 and 256, so it's equally suited for both hash sizes. It would be worth investigating how switching to u64
or u8
affects performance.
There is likely still optimization potential in the arithmetic functions. It is for instance not clear why the normalization step in the division function takes as long as it does. Possibly those functions can still be reformulated to reduce their CPU time costs.
Previous: Why is this blog so ugly? Home Next: Dosbox with MIDI on the Steam Deck