__Intro__

__Results__

The standard architecture ouputs a softmax over the 10 class labels interpreted as the probability of each class given then input. On a noise input I suppose we would want all of the 10 probabilities to be small; specifically close to 1/10 since they must sum to 1.

The following histogram is produced by feeding 500 random noise inputs into an MNIST-trained architecture and counting the magnitude of the maximally activated class probability for each image. We can see for instance that ~20 of the noise images activated a class probability at over 90%.

It does not appear that neural nets handle noise inputs well.

Heap's algorithm (named after its inventor and not the data structure) generates all

The following is the pattern of swaps generated by Heap's algorithm for N=6.

```
(1 5)
(1 6)
(1 5)
(2 6)
(1 5)
(3 6)
(1 5)
(4 6)
(1 5)
(5 6)
(1 5)
```

- Consider the first 3 swaps. These only involve elements 1, 5, 6. (Equivalent to (1)(5 6)).
- Consider the final 3 swaps. These only involve elements 1, 5, 6. (Equivalent to (1 6)(5)).
- Consider the middle swaps (all swaps not yet considered, from (2 6) to (4 6)). For these, the right column swaps will never interact with the left column swaps because they never have an element in common. The right column swaps will form a cycle and the left column will fix the 1 and 5. (1)(2 3 4 6)(5).
- In total we have:

```
(1)(5 6) (1)(2 3 4 6)(5) (1 6)(5)
```

Direct calculation reveals this reduces to a cycle:

```
(1 6 5 2 3 4)
```

- The general pattern for (even) N can be seen as:

```
(1)(N-1 N) (1)(2 3 4 5 ... N-2 N)(N-1) (1 N)(N-1)
```

which always reduces to

```
(1 N N-1 2 3 4 5 ... N-2)
```

in particular, a full cycle.

The following is the pattern of swaps generated by Heap's algorithm for N=5:

```
(1 4 3 2)
(1 5)
(1 4 3 2)
(1 5)
(1 4 3 2)
(1 5)
(1 4 3 2)
(1 5)
(1 4 3 2)
```

Direct calculation reveals this reduces to

```
(1 5)(2)(3)(4)
```

This is the general case for all (odd) N:

```
(1 N)(2)(3)...(N-1)
```

In other words only the first and last elements swap and the rest of the elements are fixed.

From the point of view of an arbitrary index location K, Heap's algorithm alternates a prefix action with a specified sequence of pair swaps depending on the prefix action.

If the prefix action cycles the prefix as in case 2 above then the pair swaps are the (1 K), (1 K), ...,(1 K) sequence. This results in the prefix that includes K doing the head-tail swap.

If the prefix action is the head-tail swap then the pair swaps are (1 K), (2 K), (3 K),...,(K-1 K) as in case 1 above. This results in the prefix that includes K doing a full cycle.

We have to keep track of an

We wish to uniformly randomly sample binary strings of length $D$ that when interpreted as integers lie between 0 and some upper bound $U,\space0 \leq U \leq 2^D-1$ .

Equivalently we wish to uniformly randomly sample leaves of a binary tree of depth $D$ where the sampled leaves lie to the left of (or on) an upper bound leaf $U$ .

We seek to minimize the number of individual $0,1$ samples necessary to produce a uniformly distributed sample string $S$ such that $S \leq U$ . We denote the expected number of individual $0,1$ samples $E_{cost}.$

The simplest procedure is to sample $S$ by concatenating $D$ independent $0,1$ samples and rejecting $S$ whenever $S\gt U$ .

We will write $E_{cost}$ for this procedure as

where $p_{acc}$ is the probability that we accept $S$ , $p_{rej}(=1-p_{acc})$ is the probability that we reject $S$ , and $E_{waste}$ is the expected number of $0,1$ samples used ("wasted") per reject $S$ that we reject.

This equation can be rewritten as

It is easy to see that if we fix $U=000...0$ then $p_{rej}=\frac{2^D-1}{2^D}$ , $p_{acc}=\frac{1}{2^D}$ , therefore $\frac{p_{rej}}{p_{acc}}=2^D-1$ . Note that under naive rejection sampling $E_{waste}=D$ . This shows that $E_{cost}$ has worst case (over choice of $U$ ) exponentially large magnitude using naive rejection sampling. Indeed, $0\leq E_{cost} \leq D(2^D-1$ ).

We will modify naive rejection sampling in two ways and then show that under this modified procedure $E_{cost} \leq D+2$ .

The modifcations are:

*Prefix matching*. We will perform rejection sampling restricted to the smallest subtree that contains all of the in-range leaves (instead of sampling from the entire tree). We will refer to this smallest containing subtree as the$SCS$ . Note that there are$D$ different$SCS$ s and that each$U$ is contained under one of them.*Early rejection*. We abort a tree descent as soon as the last$0,1$ sample places us in a subtree with no legal leaves rather than continuing for the full length$D$ .

We will call naive rejection sampling with these two modifications *modified sampling*.

Under prefix matching, the $SCS$ has the property that all of its left child's leaves are in-range and at least one of its right child's leaves is in-range (if none of the right child's leaves are in-range then we can decrease the size of the $SCS$ by descending to the left child).

This implies that $p_{acc} \geq {1\over2}.$ Therefore, $p_{acc}\gt p_{rej}$ and therefore ${p_{rej}\over p_{acc}}\leq1.$

Generally,

where $k$ is the depth of the $SCS$ and $1 \geq j>2^{D-k-1}$ is the number of in-range leaves in that subtree depending on $U$ .

We now consider $E_{waste}$ as a function of $U$ . Every $U$ specifies a unique path in the binary tree. At every depth $d$ in a path where the left child was chosen ($U_d=0$ ) the right child represents a subtree with $2^{D-(d+1)}$ leaves that can be early-rejected using $(d+1)$ samples.

This gives, for an $SCS$ of height $D$ :

For smaller $SCS$ there are two necessary observations:

- For all
$U$ that correspond to the same$SCS$ each additional$U_d=0$ contributes an additional positive term in$(4)$ and thus increases$E_{waste}(U)$ . Therefore the$U$ that maximizes$E_{waste}(U)$ for a given$SCS$ is the$U$ with all possible elements$0$ . - Across
$SCS$ of different heights$E_{waste}(U)$ is maximized by the$SCS$ that is the full binary tree for$D$ . This is because the max$U$ for smaller$SCS$ simply leave out terms from$(4)$ above.

From these observations we can conclude that for a $E_{waste}(U)$ is maximized over all $U$ at $U=100...0$ and the value of $E_{waste}(U)$ is $(4)$ with all terms in the sum included.

We can then use the fact that

to conclude that $E_{waste} \leq 2$ and this is achieved at $U=100...0$ .

We have shown that the $U$ that maximises both ${p_{rej}\over p_{acc}}$ and $E_{waste}$ is $U=100...0$ . Therefore the maximum of their product is also achieved at $U=100...0$ and in particular is at most:

which directly implies that $E_{cost} \leq D+2$ .

Similar reasoning seems to show that $E_{cost} \leq D+n$ .

Subscribe to:
Posts (Atom)