random computation: Uniformly sampling large binary integers with an upper bound

Problem

We wish to uniformly randomly sample binary strings of length

D

that when interpreted as integers lie between 0 and some upper bound

U,\space0 \leq U \leq 2^D-1

Equivalently we wish to uniformly randomly sample leaves of a binary tree of depth

D

where the sampled leaves lie to the left of (or on) an upper bound leaf

U

We seek to minimize the number of individual

0,1

samples necessary to produce a uniformly distributed sample string

S

such that

S \leq U

. We denote the expected number of individual

0,1

samples

E_{cost}.

depends on

D

U

, and the particular sampling procedure we use. However it is clear, over all choices of

D

U

, and sampling procedure, that

E_{cost} \geq D

and achieves this lower bound for

U=11...1

. (When

U=11...1

any leaf is valid and it takes

D

samples to select a leaf).

Naive rejection has worst case exponential $E_{cost}$

The simplest procedure is to sample

S

by concatenating

D

independent

0,1

samples and rejecting

S

whenever

S\gt U

We will write

E_{cost}

for this procedure as

E_{cost}=p_{acc}D + p_{rej}(E_{waste}+E_{cost})\tag{1}

where

p_{acc}

is the probability that we accept

S

p_{rej}(=1-p_{acc})

is the probability that we reject

S

, and

E_{waste}

is the expected number of

0,1

samples used ("wasted") per reject

S

that we reject.

This equation can be rewritten as

E_{cost}=D + \frac{p_{rej}}{p_{acc}}E_{waste}\tag{2}.

It is easy to see that if we fix

U=000...0

then

p_{rej}=\frac{2^D-1}{2^D}

p_{acc}=\frac{1}{2^D}

, therefore

\frac{p_{rej}}{p_{acc}}=2^D-1

. Note that under naive rejection sampling

E_{waste}=D

. This shows that

E_{cost}

has worst case (over choice of

U

) exponentially large magnitude using naive rejection sampling. Indeed,

0\leq E_{cost} \leq D(2^D-1

Prefix matching and early rejection

We will modify naive rejection sampling in two ways and then show that under this modified procedure

E_{cost} \leq D+2

The modifcations are:

Prefix matching. We will perform rejection sampling restricted to the smallest subtree that contains all of the in-range leaves (instead of sampling from the entire tree). We will refer to this smallest containing subtree as the $SCS$ . Note that there are $D$ different $SCS$ s and that each $U$ is contained under one of them.
Early rejection. We abort a tree descent as soon as the last $0,1$ sample places us in a subtree with no legal leaves rather than continuing for the full length $D$ .

We will call naive rejection sampling with these two modifications modified sampling.

Under modified sampling, $p_{rej}\over p_{acc}$ achieves maximum at $U=100...0$

Under prefix matching, the

SCS

has the property that all of its left child's leaves are in-range and at least one of its right child's leaves is in-range (if none of the right child's leaves are in-range then we can decrease the size of the

SCS

by descending to the left child).

This implies that

p_{acc} \geq {1\over2}.

Therefore,

p_{acc}\gt p_{rej}

and therefore

{p_{rej}\over p_{acc}}\leq1.

Generally,

\frac{p_{rej}}{p_{acc}} = \frac{2^{D-k}-j}{2^{D-k}+j}\tag{Eqn. 3}

where

k

is the depth of the

SCS

and

1 \geq j>2^{D-k-1}

is the number of in-range leaves in that subtree depending on

U

(3)

is maximized when

k=0

and

j=1

. This occurs for

U=100...0

Under modified sampling, $E_{waste}$ achieves maximum at $U=100...0$

We now consider

E_{waste}

as a function of

U

. Every

U

specifies a unique path in the binary tree. At every depth

d

in a path where the left child was chosen (

U_d=0

) the right child represents a subtree with

2^{D-(d+1)}

leaves that can be early-rejected using

(d+1)

samples.

This gives, for an

SCS

of height

D

E_{waste}(U)=\sum_{\{d|U_d=0\}}^{D} (d+1)(\frac{2^{D-(d+1)}}{2^D})=\sum_{\{d|U_d=0\}}^{D} \frac{(d+1)}{2^{(d+1)}}.\tag{4}

For smaller

SCS

there are two necessary observations:

For all $U$ that correspond to the same $SCS$ each additional $U_d=0$ contributes an additional positive term in $(4)$ and thus increases $E_{waste}(U)$ . Therefore the $U$ that maximizes $E_{waste}(U)$ for a given $SCS$ is the $U$ with all possible elements $0$ .
Across $SCS$ of different heights $E_{waste}(U)$ is maximized by the $SCS$ that is the full binary tree for $D$ . This is because the max $U$ for smaller $SCS$ simply leave out terms from $(4)$ above.

From these observations we can conclude that for a

E_{waste}(U)

is maximized over all

U

U=100...0

and the value of

E_{waste}(U)

(4)

with all terms in the sum included.

We can then use the fact that

\lim_{D \to \infty} \sum_{d=0}^{D} \frac{(d+1)}{2^{(d+1)}}=2

to conclude that

E_{waste} \leq 2

and this is achieved at

U=100...0

Worst case is therefore $D+2$

We have shown that the

U

that maximises both

{p_{rej}\over p_{acc}}

and

E_{waste}

U=100...0

. Therefore the maximum of their product is also achieved at

U=100...0

and in particular is at most:

(\frac{2^{D}-1}{2^{D}+1})(2)

which directly implies that

E_{cost} \leq D+2

Extension to $n$ -ary strings

Similar reasoning seems to show that

E_{cost} \leq D+n

random computation

Tuesday, April 16, 2019

Uniformly sampling large binary integers with an upper bound

Problem

Naive rejection has worst case exponential $E_{cost}$

Prefix matching and early rejection

Under modified sampling, $p_{rej}\over p_{acc}$ achieves maximum at $U=100...0$

Under modified sampling, $E_{waste}$ achieves maximum at $U=100...0$

Worst case is therefore $D+2$

Extension to $n$ -ary strings

No comments:

Post a Comment

Tuesday, April 16, 2019

Uniformly sampling large binary integers with an upper bound

Problem

Naive rejection has worst case exponential E_{cost}Ecost​

Prefix matching and early rejection

Under modified sampling, p_{rej}\over p_{acc}pacc​prej​​ achieves maximum at U=100...0U=100...0

Under modified sampling, E_{waste}Ewaste​ achieves maximum at U=100...0U=100...0

Worst case is therefore D+2D+2

Extension to nn-ary strings

No comments:

Post a Comment

Naive rejection has worst case exponential $E_{cost}$

Under modified sampling, $p_{rej}\over p_{acc}$ achieves maximum at $U=100...0$

Under modified sampling, $E_{waste}$ achieves maximum at $U=100...0$

Worst case is therefore $D+2$

Extension to $n$ -ary strings