Puid

Simple, fast, flexible and efficient generation of probably unique identifiers (puid, aka random strings) of intuitively specified entropy using pre-defined or custom characters.

iex> defmodule(RandId, do: use(Puid, chars: :alpha, total: 1.0e5, risk: 1.0e12))
iex> RandId.generate()
"YAwrpLRqXGlny"

Hex VersionLicense: MIT

Table of Contents

Overview

Puid provides a means to create modules for generating random IDs. Specifically, Puid allows full control over all three key characteristics of generating random strings: entropy source, ID characters and ID randomness.

A general overview provides information relevant to the use of Puid for random IDs.

Usage

Puid is used to create individual modules for random ID generation. Creating a random ID generator module is a simple as:

iex> defmodule(SessionId, do: use(Puid))
iex> SessionId.generate()
"8nGA2UaIfaawX-Og61go5A"

The code above use default parameters, so Puid creates a module suitable for generating session IDs (ID entropy for the default module is 132 bits). Options allow easy and complete control of the important facets of ID generation.

Entropy Source

Puid uses :crypto.strong_rand_bytes/1 as the default entropy source. The rand_bytes option can be used to specify any function of the form (non_neg_integer) -> binary as the source:

iex > defmodule(PrngId, do: use(Puid, rand_bytes: &:rand.bytes/1))
iex> PrngId.generate()
"bIkrSeU6Yr8_1WHGvO0H3M"

Characters

By default, Puid use the RFC 4648 file system & URL safe characters. The chars option can by used to specify any of the 31 pre-defined character sets or custom characters, including Unicode:

iex> defmodule(HexId, do: use(Puid, chars: :hex))
iex> HexId.generate()
"13fb81e35cb89e5daa5649802ad4bbbd"
iex> defmodule(Base58Id, do: use(Puid, chars: :base58))
iex> Base58Id.generate()
"vRxen9A4vejoX4U66iaHna"
iex> defmodule(DingoskyId, do: use(Puid, chars: "dingosky"))
iex> DingoskyId.generate()
"yiidgidnygkgydkodggysonydodndsnkgksgonisnko"
iex> defmodule(DingoskyUnicodeId, do: use(Puid, chars: "dîñgø$kyDÎÑGØßK¥", total: 2.5e6, risk: 1.0e15))
iex> DingoskyUnicodeId.generate()
"øßK$ggKñø$dyGîñdyØøØÎîk"

Captured Entropy

The default Puid module generates IDs that have 132-bit entropy. Puid provides a simple, intuitive way to specify ID randomness by declaring a total number of possible IDs with a specified risk of a repeat in that many IDs:

To generate up to 10 million random IDs with 1 in a trillion chance of repeat:

iex> defmodule(MyPuid, do: use(Puid, total: 10.0e6, risk: 1.0e15))
iex> MyPuid.generate()
"T0bFZadxBYVKs5lA"

The bits option can be used to directly specify an amount of ID randomness:

iex> defmodule(Token, do: use(Puid, bits: 256, chars: :hex_upper))
iex> Token.generate()
"6E908C2A1AA7BF101E7041338D43B87266AFA73734F423B6C3C3A17599F40F2A"

Sampler

The sampler option controls how source entropy bits are transformed into character indices for non-power-of-2 character sets:

iex> defmodule(BitShiftId, do: use(Puid, chars: :alphanum_lower, sampler: :bit_shift))
iex> defmodule(IntervalId, do: use(Puid, chars: :alphanum_lower, sampler: :interval))
iex> IntervalId.info().ete > BitShiftId.info().ete
true

info().ete is sampler-aware and reflects the selected strategy.

General Note

The mathematical approximations used by Puid always favor conservative estimation:

Installation

Add puid to mix.exs dependencies:

def deps,
do: [
{:puid, "~> 2.1"}
]

Update dependencies

mix deps.get

Module API

Puid modules have the following functions:

The total/1, risk/1 functions provide approximations to the risk of a repeat in some total number of generated puids. The mathematical approximations used purposely overestimaterisk and underestimatetotal.

The encode/1, decode/1 functions convert String.t()puids to and from bitstringbits to facilitate binary data storage, e.g. as an Ecto type.

The info/0 function returns a Puid.Info structure consisting of:

Example

iex> defmodule(SafeId, do: use(Puid))
iex> SafeId.generate()
"CSWEPL3AiethdYFlCbSaVC"
iex> SafeId.total(1_000_000)
104350568690606000
iex> SafeId.risk(1.0e12)
9007199254740992
iex> SafeId.decode("CSWEPL3AiethdYFlCbSaVC")
<<9, 37, 132, 60, 189, 192, 137, 235, 97, 117, 129, 101, 9, 180, 154, 84, 32>>
iex> SafeId.encode(<<9, 37, 132, 60, 189, 192, 137, 235, 97, 117, 129, 101, 9, 180, 154, 84, 2::size(4)>>)
"CSWEPL3AiethdYFlCbSaVC"
iex> SafeId.info()
%Puid.Info{
characters: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_",
char_set: :safe64,
entropy_bits: 132.0,
entropy_bits_per_char: 6.0,
ere: 0.75,
ete: 1.0,
length: 22,
rand_bytes: &:crypto.strong_rand_bytes/1
}

Characters

Puid Predefined Charsets

NameCountEREETECharacters
:alpha525.70.84ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
:alpha_lower264.70.81abcdefghijklmnopqrstuvwxyz
:alpha_upper264.70.81ABCDEFGHIJKLMNOPQRSTUVWXYZ
:alphanum625.950.97ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
:alphanum_lower365.170.65abcdefghijklmnopqrstuvwxyz0123456789
:alphanum_upper365.170.65ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
:base16164.01.00123456789ABCDEF
:base32325.01.0ABCDEFGHIJKLMNOPQRSTUVWXYZ234567
:base32_hex325.01.00123456789abcdefghijklmnopqrstuv
:base32_hex_upper325.01.00123456789ABCDEFGHIJKLMNOPQRSTUV
:base36365.170.650123456789abcdefghijklmnopqrstuvwxyz
:base36_upper365.170.650123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ
:base45455.490.780123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ $%*+-./:
:base58585.860.91123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz
:base62625.950.97ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
:base85856.410.77!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu
:bech32325.01.0023456789acdefghjklmnpqrstuvwxyz
:boolean21.01.0TF
:crockford32325.01.00123456789ABCDEFGHJKMNPQRSTVWXYZ
:decimal103.320.620123456789
:dna42.01.0ACGT
:geohash325.01.00123456789bcdefghjkmnpqrstuvwxyz
:hex164.01.00123456789abcdef
:hex_upper164.01.00123456789ABCDEF
:safe_ascii906.490.8!#$%&()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_abcdefghijklmnopqrstuvwxyz{|}~
:safe32325.01.02346789bdfghjmnpqrtBDFGHJLMNPQRT
:safe64646.01.0ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_
:symbol284.810.89!#$%&()*+,-./:;<=>?@[]^_{|}~
:url_safe666.040.63ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~
:word_safe32325.01.023456789CFGHJMPQRVWXcfghjmpqrvwx
:z_base32325.01.0ybndrfg8ejkmcpqxot1uwisza345h769

Note: The Metrics section explains ERE and ETE. ETE values in the table above use the default sampler (:bit_shift).

Description of non-obvious character sets
NameDescription
:base16https://datatracker.ietf.org/doc/html/rfc4648#section-8
:base32https://datatracker.ietf.org/doc/html/rfc4648#section-6
:base32_hexLowercase of :base32_hex_upper
:base32_hex_upperhttps://datatracker.ietf.org/doc/html/rfc4648#section-7
:base36Used by many URL shorteners
:base58Bitcoin base58 alphabet (excludes 0, O, I, l)
:base85Used in Adobe PostScript and PDF
:bech32Bitcoin SegWit address encoding
:dnaDNA nucleotide bases (Adenine, Cytosine, Guanine, Thymine)
:ascii85Same as :safe_ascii
:ascii90Same as :base85
:crockford32https://www.crockford.com/base32.html
:geohashUsed for encoding geographic coordinates
:safe_asciiPrintable ascii that does not require escape in String
:safe32Alpha and numbers picked to reduce chance of English words
:safe64https://datatracker.ietf.org/doc/html/rfc4648#section-5
:url_safehttps://datatracker.ietf.org/doc/html/rfc3986#section-2.3
:word_safe32Alpha and numbers picked to reduce chance of English words
:z_base32Zooko's Base32

Custom

Any String of up to 256 unique characters can be used for puid generation, with custom characters optimized in the same manner as the pre-defined character sets. The characters must be unique. This isn't strictly a technical requirement, PUID could handle duplicate characters, but the resulting randomness of the IDs is maximal when the characters are unique, so PUID enforces that restriction.

Metrics

Entropy Representation Efficiency

Entropy Representation Efficiency (ERE) is a measure of how efficient a string ID represents the entropy of the ID itself. When referring to the entropy of an ID, we mean the Shannon Entropy of the character sequence, and that is maximal when all the permissible characters are equally likely to occur. In most random ID generators, this is the case, and the ERE is solely dependent on the count of characters in the charset, where each character represents log2(count) of entropy (a computer specific calc of general Shannon entropy). For example, for a hex charset there are 16 hex characters, so each character "carries" log2(16) = 4 bits of entropy in the string ID. We say the bits per character is 4 and a random ID of 12 hex characters has 48 bits of entropy.

ERE is measured as a ratio of the bits of entropy for the ID divided by the number of bits require to represent the string (8 bits per ID character). If each character is equally probably (the most common case), ERE is (bits-per-char * id_len) / (8 bits * id_len), which simplifies to bits-per-character/8. The BPC displayed in the Puid Characters table is equivalent to the ERE for that charset.

There is, however, a particular random ID exception where each character is not equally probable, namely, the often used v4 format of UUIDs. In that format, there are hyphens that carry no entropy (entropy is uncertainty, and there is no uncertainly as to where those hyphens will be), one hex digit that is actually constrained to 1 of only 4 hex values and another that is fixed. This formatting results in a ID of 36 characters with a total entropy of 122 bits. The ERE of a v4 UUID is, therefore, 122 / (8 * 36) = 0.4236.

Entropy Transform Efficiency

Entropy Transform Efficiency (ETE) is a measure of how efficiently source entropy is transformed into random ID entropy. For charsets with a character count that is a power of 2, all of the source entropy bits can be utilized during random ID generation. Each generated ID character requires exactly log2(count) bits, so the incoming source entropy can easily be carved into appropriate indices for character selection. Since ETE represents the ratio of output entropy bits to input entropy source, when all of the bits are utilized ETE is 1.0.

Even for charsets with power of 2 character count, ETE is only the theoretical maximum of 1.0if the input entropy source is used as described above. Unfortunately, that is not the case with many random ID generation schemes. Some schemes use the entire output of a call to source entropy to create a single index used to select a character. Such schemes have very poor ETE.

For charsets with a character count that is not a power of 2, some bits will inevitably be discarded since the smallest number of bits required to select a character, ceil(log2(count)), will potentially result in an index beyond the character count. A first-cut, naïve approach to this reality is to simply throw away all the bits when the index is too large.

However, more sophisticated schemes can improve on the naïve approach. For non-power-of-2 character sets, Puid supports two sampler strategies:

As an example of :bit_shift, using the :alphanum_lower charset (36 chars), ceil(log2(36)) = 6 bits are required to create an index. If those bits start with 11xxxx, the index is out of bounds regardless of xxxx, so Puid discards only the first two bits and reuses the trailing four bits in the next draw. This improves ETE versus naïve full-slice rejection (naïve: 0.485, :bit_shift: 0.646, a 33% improvement).

For the same :alphanum_lower example, :interval improves ETE further to about 0.937 (compared with 0.646 for :bit_shift and 0.485 for naïve full-slice rejection), at some additional computational cost. The bench/alphanum_lower_bit_shift_ete.exs and bench/alphanum_lower_interval_ete.exs scripts provide parallel detailed walk-throughs, and info().ete reports the selected sampler’s efficiency.

Comparisons

Speed

bench/compare_libs.exs provides some comparison of Puid and other random ID libraries.

Notes

Example

MIX_ENV=test mix run bench/compare_libs.exs
Name ips average deviation median 99th %
Puid hex (CSPRNG) 47.74 20.95 ms ±1.13% 20.90 ms 21.59 ms
SecureRandom urlsafe_base64 40.63 24.61 ms ±2.90% 24.28 ms 27.22 ms
Puid safe64 (PRNG) 32.80 30.48 ms ±5.00% 30.47 ms 33.02 ms
Puid safe64 (CSPRNG) 29.75 33.61 ms ±3.30% 33.21 ms 38.22 ms
UUID v4 (string) 27.39 36.51 ms ±6.40% 35.26 ms 41.94 ms
Puid alphanum (CSPRNG) 12.68 78.85 ms ±1.44% 78.72 ms 82.99 ms
EntropyString safe64 5.69 175.63 ms ±0.77% 175.78 ms 178.20 ms
Randomizer alphanum 22 4.63 216.09 ms ±14.59% 206.54 ms 304.81 ms
Common Solution alphanum 1.25 797.20 ms ±1.14% 796.77 ms 806.53 ms
Nanoid (CSPRNG) 0.79 1266.43 ms ±0.99% 1266.43 ms 1275.32 ms
Comparison:
Puid hex (CSPRNG) 47.74
SecureRandom urlsafe_base64 40.63 - 1.17x slower +3.66 ms
Puid safe64 (PRNG) 32.80 - 1.46x slower +9.54 ms
Puid safe64 (CSPRNG) 29.75 - 1.60x slower +12.66 ms
UUID v4 (string) 27.39 - 1.74x slower +15.57 ms
Puid alphanum (CSPRNG) 12.68 - 3.76x slower +57.90 ms
EntropyString safe64 5.69 - 8.38x slower +154.68 ms
Randomizer alphanum 22 4.63 - 10.31x slower +195.14 ms
Common Solution alphanum 1.25 - 38.05x slower +776.25 ms
Nanoid (CSPRNG) 0.79 - 60.45x slower +1245.48 ms

Puid charsets ID length and ERE

The bench/puid_ere_len.exs script outputs a markdown table comparing the number of actual entropy bits and resulting puid lengths for each Puid predefined charset. The default target bits are 64, 86, 128, 256.

mix run bench/puid_ere_len.exs [bits...]
charsetbitslenbitslenbitslen
6496128
alpha68.411296.9117131.1123
alpha_lower65.811498.7121131.6128
alpha_upper65.811498.7121131.6128
alphanum65.511101.2217130.9922
alphanum_lower67.211398.2319129.2525
alphanum_upper67.211398.2319129.2525
base1664.01696.024128.032
base3265.013100.020130.026
base32_hex65.013100.020130.026
base32_hex_upper65.013100.020130.026
base5864.441199.5917128.8822
crockford3265.013100.020130.026
decimal66.442096.3429129.5639
hex64.01696.024128.032
hex_upper64.01696.024128.032
safe_ascii64.921097.3815129.8420
safe3265.013100.020130.026
safe6466.01196.016132.022
symbol67.31496.1520129.827
word_safe3265.013100.020130.026