Tuesday, April 16, 2019

Uniformly sampling large binary integers with an upper bound


Problem

We wish to uniformly randomly sample binary strings of length D that when interpreted as integers lie between 0 and some upper bound U,\space0 \leq U \leq 2^D-1.

Equivalently we wish to uniformly randomly sample leaves of a binary tree of depth Dwhere the sampled leaves lie to the left of (or on) an upper bound leaf U.

We seek to minimize the number of individual 0,1 samples necessary to produce a uniformly distributed sample string S such that S \leq U. We denote the expected number of individual 0,1 samples E_{cost}.

E_{cost}
 depends on DU, and the particular sampling procedure we use. However it is clear, over all choices of DU, and sampling procedure, that E_{cost} \geq D and achieves this lower bound for U=11...1. (When U=11...1 any leaf is valid and it takes D samples to select a leaf).

Naive rejection has worst case exponential E_{cost}

The simplest procedure is to sample S by concatenating D independent 0,1 samples and rejecting S whenever S\gt U.

We will write E_{cost} for this procedure as


E_{cost}=p_{acc}D + p_{rej}(E_{waste}+E_{cost})\tag{1}


where p_{acc} is the probability that we accept Sp_{rej}(=1-p_{acc}) is the probability that we reject S, and E_{waste} is the expected number of 0,1 samples used ("wasted") per reject Sthat we reject.

This equation can be rewritten as


E_{cost}=D + \frac{p_{rej}}{p_{acc}}E_{waste}\tag{2}.


It is easy to see that if we fix U=000...0 then p_{rej}=\frac{2^D-1}{2^D}p_{acc}=\frac{1}{2^D}, therefore \frac{p_{rej}}{p_{acc}}=2^D-1. Note that under naive rejection sampling E_{waste}=D. This shows that E_{cost}has worst case (over choice of U) exponentially large magnitude using naive rejection sampling. Indeed, 0\leq E_{cost} \leq D(2^D-1).

Prefix matching and early rejection

We will modify naive rejection sampling in two ways and then show that under this modified procedure E_{cost} \leq D+2.

The modifcations are:
  1. Prefix matching. We will perform rejection sampling restricted to the smallest subtree that contains all of the in-range leaves (instead of sampling from the entire tree). We will refer to this smallest containing subtree as the SCS. Note that there are Ddifferent SCSs and that each U is contained under one of them.
  2. Early rejection. We abort a tree descent as soon as the last 0,1 sample places us in a subtree with no legal leaves rather than continuing for the full length D.
We will call naive rejection sampling with these two modifications modified sampling.

Under modified sampling, p_{rej}\over p_{acc} achieves maximum at U=100...0

Under prefix matching, the SCS has the property that all of its left child's leaves are in-range and at least one of its right child's leaves is in-range (if none of the right child's leaves are in-range then we can decrease the size of the SCS by descending to the left child).

This implies that p_{acc} \geq {1\over2}. Therefore, p_{acc}\gt p_{rej} and therefore {p_{rej}\over p_{acc}}\leq1.
Generally,

\frac{p_{rej}}{p_{acc}} = \frac{2^{D-k}-j}{2^{D-k}+j}\tag{Eqn. 3}


where k is the depth of the SCS and 1 \geq j>2^{D-k-1} is the number of in-range leaves in that subtree depending on U.

(3) is maximized when k=0 and j=1. This occurs for U=100...0.

Under modified sampling, E_{waste} achieves maximum at U=100...0

We now consider E_{waste} as a function of U. Every U specifies a unique path in the binary tree. At every depth d in a path where the left child was chosen (U_d=0) the right child represents a subtree with 2^{D-(d+1)} leaves that can be early-rejected using (d+1) samples.

This gives, for an SCS of height D:


E_{waste}(U)=\sum_{\{d|U_d=0\}}^{D} (d+1)(\frac{2^{D-(d+1)}}{2^D})=\sum_{\{d|U_d=0\}}^{D} \frac{(d+1)}{2^{(d+1)}}.\tag{4}


For smaller SCS there are two necessary observations:
  1. For all U that correspond to the same SCS each additional U_d=0 contributes an additional positive term in (4) and thus increases E_{waste}(U). Therefore the U that maximizes E_{waste}(U) for a given SCS is the U with all possible elements 0.
  2. Across SCS of different heights E_{waste}(U) is maximized by the SCS that is the full binary tree for D. This is because the max U for smaller SCS simply leave out terms from (4) above.
From these observations we can conclude that for a E_{waste}(U) is maximized over all U at U=100...0 and the value of E_{waste}(U) is (4) with all terms in the sum included.

We can then use the fact that


\lim_{D \to \infty} \sum_{d=0}^{D} \frac{(d+1)}{2^{(d+1)}}=2


to conclude that E_{waste} \leq 2 and this is achieved at U=100...0.

Worst case is therefore D+2

We have shown that the U that maximises both {p_{rej}\over p_{acc}} and E_{waste} is U=100...0. Therefore the maximum of their product is also achieved at U=100...0 and in particular is at most:


(\frac{2^{D}-1}{2^{D}+1})(2)


which directly implies that E_{cost} \leq D+2.

Extension to n-ary strings

Similar reasoning seems to show that E_{cost} \leq D+n.