Base64 explained: 6-bit grouping, ~33% size inflation, padding, and base64url vs standard
What Base64 is — and what it is not
Base64 is a binary-to-text encoding scheme. It converts arbitrary binary data into a string of printable ASCII characters, making that data safe to embed in contexts that were designed for human-readable text. Base64 is not compression — it makes data larger. It is not encryption — anyone who has the encoded string can recover the original bytes trivially. The name comes from the fact that it uses an alphabet of exactly 64 characters, chosen because 2⁶ = 64 allows each character to represent exactly 6 bits of input.
Base64 was not invented as a stand-alone standard; it grew out of the need to send binary email attachments over SMTP infrastructure that only reliably transmitted 7-bit ASCII. RFC 2045 (1996), which defines MIME, standardised the encoding for email. A decade later, RFC 4648 (2006) consolidated base64 and its variants — including base32 and base64url — into a single reference document. RFC 4648 is still the authoritative specification used today for any new protocol that needs a binary-to-text encoding.
The 6-bit grouping: how three input bytes become four characters
The core of base64 is a simple bit-repackaging operation. Binary data arrives in groups of 8-bit bytes; base64 regroups those bits into 6-bit chunks, each of which maps to one character in the 64-character alphabet. Because 6 and 8 share a least common multiple of 24, the natural processing unit is 3 bytes = 24 bits → 4 characters (4 × 6 bits). Take the ASCII string Man as a concrete example: the three bytes are 0x4D, 0x61, 0x6E, giving the bit sequence 01001101 01100001 01101110. Regroup into four 6-bit values: 010011 (19 → T), 010110 (22 → W), 000101 (5 → F), 101110 (46 → u). The encoded result is TWFu.
This 3-to-4 byte ratio is the direct source of the size inflation. For every 3 bytes of input, the output is always 4 characters regardless of the actual byte values. For n bytes of input, the encoded output is ceil(n / 3) × 4 characters (before any padding), meaning the output is always 33% larger than the input in the general case (precisely 4/3 times larger). This expansion is fixed and predictable, unlike compression algorithms whose output size varies with the input content. A 1 KB file encodes to approximately 1.37 KB; a 1 MB image encodes to approximately 1.37 MB.
The alphabet: A–Z, a–z, 0–9, +, /
The standard base64 alphabet maps 6-bit values 0–63 to printable ASCII characters in this order: values 0–25 map to uppercase A–Z, values 26–51 map to lowercase a–z, values 52–61 map to digits 0–9, value 62 maps to +, and value 63 maps to /. This ordering places letters before digits — the reverse of ASCII numeric order — for historical reasons rooted in MIME compatibility and visual readability of the encoded output. The choice of + and / for the last two values was arbitrary but has caused decades of interoperability headaches, as both characters carry special meaning in URLs.
A common misconception is that base64 is character-set agnostic. It is not: the alphabet is fixed ASCII. A decoder that encounters a character outside the 65-character set (64 data characters plus = for padding) must either reject the input or skip it depending on the specification context. RFC 4648 recommends rejecting non-alphabet characters in strict mode. The equals sign = is a sentinel, not a data character — it marks padding, as discussed in the next section, and should never appear except at the very end of an encoded string.
Padding: why the equals signs exist and when they can be dropped
Because the encoding processes 3 bytes at a time, inputs whose length is not divisible by 3 leave a partial group at the end. Padding solves this: it forces the encoded output to always be a multiple of 4 characters, making it possible to calculate the original byte count from the output length alone. If the input has 1 remainder byte (1 leftover byte of 8 bits), 2 padding characters == are appended — the 2 encoded characters carry 12 bits, but only 8 are data; the rest are zero-padded. If the input has 2 remainder bytes (2 leftover bytes of 16 bits), 1 padding character = is appended — the 3 encoded characters carry 18 bits, but only 16 are data.
Padding is optional in contexts where the total length is known by other means. RFC 7515, the JSON Web Signature specification, explicitly requires implementations to omit = padding when encoding JWT components, because the dots between the three parts of a JWT already delimit the segments. Base64url-encoded values in JWTs therefore never end in =. Stripping padding and restoring it are both straightforward: to restore padding, append = characters until the string length is a multiple of 4. Most modern base64 libraries accept both padded and unpadded input without special configuration.
Base64url: the URL-safe variant used in JWTs and OAuth tokens
Standard base64's + and / characters are problematic in URL contexts. A + in a query string is interpreted as a space under the application/x-www-form-urlencoded encoding used by HTML forms. A / is a path-segment separator. To use a standard base64 string in a URL, each + and / must be percent-encoded as %2B and %2F respectively, making the string longer and harder to read. Base64url, defined in RFC 4648 Section 5, replaces + with - and / with _. These two characters are safe in URLs without percent-encoding. Padding = is also typically omitted in base64url contexts to avoid %3D in URLs.
Base64url is used wherever a compact, URL-safe binary representation is needed without a separate encoding step: the header and payload sections of a JWT are base64url-encoded JSON; OAuth 2.0 authorization codes and tokens are often base64url-encoded random bytes; PKCE (RFC 7636) code verifiers and challenges use base64url. The data inflation ratio is identical to standard base64 — exactly 4/3 — because only the alphabet differs, not the bit-grouping algorithm. Decoding base64url is the same operation as decoding standard base64 after substituting - → + and _ → /.
Practical size impact and choosing between base64 and binary
The ~33% overhead is the main practical cost of base64. For Data URIs — embedding images, fonts, or SVGs directly in HTML or CSS with data:image/png;base64,... — the tradeoff is eliminating one HTTP round-trip at the cost of a larger document. This is beneficial for small resources (typically under 4–8 KB): the saved round-trip outweighs the size increase, especially on high-latency connections. For larger resources, the size increase degrades performance because the larger HTML document takes longer to parse, blocks rendering, and cannot be cached separately from the page itself.
HTTP Basic Authentication encodes credentials as base64(username:password) in the Authorization: Basic ... header. This is not a security measure — the username and password are trivially recoverable by decoding. Basic Auth requires HTTPS to be safe; the base64 encoding exists only because the HTTP header specification requires printable ASCII. Similarly, base64-encoded data in a JSON response or a JWT payload is fully readable by any party who intercepts the data. If a payload contains sensitive information — a private key, a social security number, a medical record — base64 encoding is not a substitute for encryption. Use authenticated encryption (AES-GCM, for example) or JSON Web Encryption (JWE, RFC 7516) for confidential data.