Faster base64 in curl

This adventure started with an issue where a user pointed out that the libcurl function for base64 encoding actually would allocate a few bytes too many at times.

That turned out to be true and we fixed it fairly quickly.

As I glanced at that base64 encoder function that was still loaded and showing in my editor window, it struck me that it really was not written in an optimal way.

Base64 encoding

This “encoding” converts 8 bit data into a 6 bit data, where each 6 bit combination has a dedicated ASCII character. It uses these 64 different characters: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/.

Three 8-bit bytes make up 24 bit of data, which can be represented by four 6-bit symbols. Like this: the example byte sequence 0x12 , 0x34 and 0x56 creates the 24 bit value 0x123456. Shown in binary it looks like: 000100100 0110100 01010110

That 24 bit number is split into 6-bit chunks: 000100, 100011, 010001 and 010110. Written in decimal, they are 4, 35, 17 and 22. Pick the corresponding symbols from those indexes in the base64 table shown above and they make the base64 encoded sequence: EjRW. And so on.

That’s base 64 encoding.

A realization

The base64 encoder function source code I looked at, was introduced in curl in the late 1990s and existed in the first commit we have saved. It has remained mostly intact since. Over twenty two years old.

This is how the code used to look. Fairly readable, but with a lot of conditions and perhaps most importantly, with calls to msnprintf() to output data. msnprintf() is our internal snprintf implementation,

The old base64 encoder

while(insize > 0) {
  for(i = inputparts = 0; i < 3; i++) {
    if(insize > 0 {
     inputparts++;
     ibuf[i] = (unsigned char) *indata;
     indata++;
     insize--;
   }
   else
     ibuf[i] = 0;
  }

  obuf[0] = ((ibuf[0] & 0xFC) >> 2);
  obuf[1] = (((ibuf[0] & 0x03) << 4) |
            ((ibuf[1] & 0xF0) >> 4));
  obuf[2] = (((ibuf[1] & 0x0F) << 2) |
            ((ibuf[2] & 0xC0) >> 6));
  obuf[3] = (ibuf[2] & 0x3F);

  switch(inputparts) {
  case 1: /* only one byte read */
    i = msnprintf(output, 5, "%c%c%s%s",
                  table64[obuf[0]],
                  table64[obuf[1]],
                  padstr,
                  padstr);
    break;

  case 2: /* two bytes read */
    i = msnprintf(output, 5, "%c%c%c%s",
                  table64[obuf[0]],
                  table64[obuf[1]],
                  table64[obuf[2]],
                  padstr);
    break;

  default:
    i = msnprintf(output, 5, "%c%c%c%c",
                  table64[obuf[0]],
                  table64[obuf[1]],
                  table64[obuf[2]],
                  table64[obuf[3]]);
    break;
  }
  output += i;

}

(The padstr variable in there is for handling the mode where it does not output any final = padding characters. )

Improving

I started out by writing a test program. I created a huge string, and made the test program base64-encode a part of that string. Starting from one byte string, then increasing the length one by one to iterating over all sizes until the final size which happened to be 106128 bytes. Maybe not the most realistic test in the world, but at least it does a lot of base64 encoding. The base64 encode algorithm is also content agnostic so it doesn’t matter what the exact content is, the size of it is the main thing.

My first casual attempt that only replaced the snprintf() calls with direct assigns into the target buffer first made me doubt my numbers or test program. My test program ran 14 times faster.

Motivated by the enormous performance gain seen with that minor change, I continued. I removed the use of the obuf array and it occurred to me I should deal with the encoding in two phases; one main one for all complete three-byte triplets and then do the padding final chunk separately – as then we can avoid conditions in the main loop.

The final result that I ended up merging showed an almost 29 times improvement. With the old code the test program took six minutes to complete, the new one finished in twelve seconds.

The new encoder

  while(insize >= 3) {
    *output++ = table64[ in[0] >> 2 ];
    *output++ = table64[ ((in[0] & 0x03) << 4) |
                         (in[1] >> 4) ];
    *output++ = table64[ ((in[1] & 0x0F) << 2) | 
                         ((in[2] & 0xC0) >> 6) ];
    *output++ = table64[ in[2] & 0x3F ];
    insize -= 3;
    in += 3;
  }
  if(insize) {
    /* this is only one or two bytes now */
    *output++ = table64[ in[0] >> 2 ];
    if(insize == 1) {
      *output++ = table64[(in[0] & 0x03) << 4];
      if(*padstr) {
        *output++ = *padstr;
        *output++ = *padstr;
      }
    }
    else {
      /* insize == 2 */
      *output++ = table64[((in[0] & 0x03) << 4) |
                          ((in[1] & 0xF0) >> 4)];
      *output++ = table64[(in[1] & 0x0F) << 2];
      if(*padstr)
        *output++ = *padstr;
    }
  }

I think the new version still is highly readable, and it actually is significantly smaller in size than the previous version!

Base64 decoding

Energized by that fascinating improvement I managed to do to the encoder function, I turned my eyes to the base64 decoder function. This function is slightly newer in curl than the encoder, but still traces back to 2001 and it too was never improved much after its initial merge.

I started again by writing another test program. This one creates a 2948 bytes base64 encoded string which the application then iterates over and decodes pieces of. From one byte up to the full size, and then it loops so that it repeats that procedure a thousand times.

When decoding base64, the core of the code needs to find out what binary number each particular input octet represents. ‘A’ is zero, ‘B’ is one etc. Then it needs to take four such consecutive (effectively 6 bit) letters at a time and output 3 eight bit bytes. Over and over. In a final round it might need to deal with =-padding to make it an even 4 bytes size.

The old decoder

This is how the old decodeQuantum function looked like. It decodes a 4-letter sequence into three output bytes.

  for(i = 0, s = src; i < 4; i++, s++) {
    if(*s == '=') {
      x <<= 6;
      padding++;
    }
    else {
      const char *p = strchr(base64, *s);
      if(p)
        x = (x << 6) + curlx_uztoul(p - base64);
      else
        return 0;
    }
  }

  if(padding < 1)
    dest[2] = curlx_ultouc(x & 0xFFUL);

  x >>= 8;
  if(padding < 2)
    dest[1] = curlx_ultouc(x & 0xFFUL);

  x >>= 8;
  dest[0] = curlx_ultouc(x & 0xFFUL);

The new decoder

What immediately sticks out in the old code is the use of strchr() to find the letters’ offset in the base64 table as a means to figure out its byte value. The larger the value, the longer the function needs to search through the string to find it. And it needs to search for every single byte used in the input string.

What if we could replace the search by a lookup table instead?

My updated code features a lookup table for each input byte, and in similar fashion to how the encoder logic works, it now separates the final padding step in a separate extra block in the end to avoid extra conditions in the main loop.

With the simplified quad decoder, I put the whole thing in the same loop. Unfortunately I could not think of a way to avoid the check for invalid characters in the loop.

It turns out this new base64 decoder is about 4.5 times faster than the previous code!

  for(i = 0; i < fullQuantums; i++) {
    unsigned char val;
    unsigned int x = 0;
    int j;

    for(j = 0; j < 4; j++) {
      val = lookup[(unsigned char)*src++];
      if(val == 0xff) /* bad symbol */
        goto bad;
      x = (x << 6) | val;
    }
    pos[2] = x & 0xff;
    pos[1] = (x >> 8) & 0xff;
    pos[0] = (x >> 16) & 0xff;
    pos += 3;
  }

Final words

I wrote my test applications to measure the impact of my changes. We could argue if they show real world use cases or not but I think they at still prove that my changes were good and made the functions faster. The exact numbers are not that important.

The exact performance numbers I mention in this blog post are based on results that I saw on my old Intel-based development machine as the best result of at least three consecutive runs. Other systems and architectures will for sure show different numbers.

Base64 encoding and decoding are not significant functions, and are not even very frequently called ones, in curl. These improvements are not likely to even be noticeable to curl or libcurl users. I still consider these optimizations worthwhile because why not do things as fast as you can if you are going to them anyway. As long as there’s no particular penalty involved.

The changes I did for the base64 handling were done entirely without changing the behavior or function prototypes in order to compartmentalize them and keep them constrained.

Credits

Image by DrZoltan from Pixabay

Discussed

On Hacker news, Reddit and lobste.rs

8 thoughts on “Faster base64 in curl”

  1. To avoid the error check in the loop how about this:

    unsigned char err = 0;
    for(i = 0; i < fullQuantums; i++) {
    unsigned char val;
    unsigned int x = 0;
    int j;
    for(j = 0; j < 4; j++) {
    val = lookup[(unsigned char)*src++];
    err |= val;
    x = (x <> 8) & 0xff;
    pos[0] = (x >> 16) & 0xff;
    pos += 3;
    }

  2. Previous post seemed to render badly.
    The gist is, maintain an initially zero unsigned char err, which you bitwise-or each looked up val into.

    For good vals, err will never exceed 0x3f. As soon as a bad val is read, it will become and stay 0xff.

    Then at the end of the loop goto bad if err equals 0xff.

    1. @Alasdair: Thanks. I actually tried exactly that approach, but I found the performance difference to be so small that I didn’t go with that version.

      In fact, it usually slows it down to add a bitwise operator there instead of the existing condition!

  3. “`
    *output++
    “`

    This is new C code that does string manipulation with raw pointers. Is this fuzzed?

    1. @mappu memory is memory. The loop is obviously bounded and contains *no* control structures. The output buffer is sufficiently large. Using something other than “raw pointers” could not make it safer than it already is.

  4. Great optimizations!!
    Really like the lookup table and post cycle processing.
    It is C and readability is a matter of perspective, but it can be improved by replacing repeated bit masks with function macro definitions: it will give meaning to the mask and will create a single point of maintainability (ex: #define GET_HIGH_6BITS(in) ((in[0]) >> 2) ).
    Please remove goto bad; It could be replaced with break and post check for i < fullQuantums: this will not decrease the performance just increasing the readability.

    1. @GeorgeRadev: I’ve dropped this and moved on now. If someone wants to further improve this area, feel free to do so and submit a PR and we can take it from there.

Comments are closed.