## Friday, August 21, 2020

### Weight visualization of 0-layer autoencoder on MNIST

Border pixels are always 0 in MNIST and thus do not receive training signal; other pixels learn the identity function for their corresponding picture in the input.

## Thursday, June 18, 2020

### What do neural nets do on noise inputs?

Intro
In this post we train a standard convolutional architecture on MNIST and then feed random noise inputs to the trained neural net.

The standard architecture ouputs a softmax over the 10 class labels interpreted as the probability of each class given then input. On a noise input I suppose we would want all of the 10 probabilities to be small; specifically close to 1/10 since they must sum to 1.

Results
The following histogram is produced by feeding 500 random noise inputs into an MNIST-trained architecture and counting the magnitude of the maximally activated class probability for each image. We can see for instance that ~20 of the noise images activated a class probability at over 90%.

It does not appear that neural nets handle noise inputs well.

# A brief explanation of Heap's algorithm

Heap's algorithm (named after its inventor and not the data structure) generates all $N!$ permutations while only swapping two elements between successive permutations and requiring only $O(N)$ memory overhead and something like $O(1)$ amortized computation per step. It's not clear that this should even be possible. Unfortunately understanding how it works doesn't seem to lend any deep insights into the fundamental nature of computation. Also unfortunately the explanation presented here requires calculating permutation compositions represented in cycle notation and seeing that it works for all N requires doing these calculations.

## CASE 1

The following is the pattern of swaps generated by Heap's algorithm for N=6.

(1 5) (1 6) (1 5) (2 6) (1 5) (3 6) (1 5) (4 6) (1 5) (5 6) (1 5)
1. Consider the first 3 swaps. These only involve elements 1, 5, 6. (Equivalent to (1)(5 6)).
2. Consider the final 3 swaps. These only involve elements 1, 5, 6. (Equivalent to (1 6)(5)).
3. Consider the middle swaps (all swaps not yet considered, from (2 6) to (4 6)). For these, the right column swaps will never interact with the left column swaps because they never have an element in common. The right column swaps will form a cycle and the left column will fix the 1 and 5. (1)(2 3 4 6)(5).
4. In total we have:
(1)(5 6) (1)(2 3 4 6)(5) (1 6)(5)

Direct calculation reveals this reduces to a cycle:

(1 6 5 2 3 4)
1. The general pattern for (even) N can be seen as:
(1)(N-1 N) (1)(2 3 4 5 ... N-2 N)(N-1) (1 N)(N-1)

which always reduces to

(1 N N-1 2 3 4 5 ... N-2)

in particular, a full cycle.

## CASE 2

The following is the pattern of swaps generated by Heap's algorithm for N=5:

(1 4 3 2) (1 5) (1 4 3 2) (1 5) (1 4 3 2) (1 5) (1 4 3 2) (1 5) (1 4 3 2)

Direct calculation reveals this reduces to

(1 5)(2)(3)(4)

This is the general case for all (odd) N:

(1 N)(2)(3)...(N-1)

In other words only the first and last elements swap and the rest of the elements are fixed.

## All together

From the point of view of an arbitrary index location K, Heap's algorithm alternates a prefix action with a specified sequence of pair swaps depending on the prefix action.

If the prefix action cycles the prefix as in case 2 above then the pair swaps are the (1 K), (1 K), ...,(1 K) sequence. This results in the prefix that includes K doing the head-tail swap.

If the prefix action is the head-tail swap then the pair swaps are (1 K), (2 K), (3 K),...,(K-1 K) as in case 1 above. This results in the prefix that includes K doing a full cycle.

We have to keep track of an $O(N)$ array and scan into it each step to know what index to swap.

## Problem

We wish to uniformly randomly sample binary strings of length $D$ that when interpreted as integers lie between 0 and some upper bound $U,\space0 \leq U \leq 2^D-1$.

Equivalently we wish to uniformly randomly sample leaves of a binary tree of depth $D$where the sampled leaves lie to the left of (or on) an upper bound leaf $U$.

We seek to minimize the number of individual $0,1$ samples necessary to produce a uniformly distributed sample string $S$ such that $S \leq U$. We denote the expected number of individual $0,1$ samples $E_{cost}.$

$E_{cost}$
depends on $D$$U$, and the particular sampling procedure we use. However it is clear, over all choices of $D$$U$, and sampling procedure, that $E_{cost} \geq D$ and achieves this lower bound for $U=11...1$. (When $U=11...1$ any leaf is valid and it takes $D$ samples to select a leaf).

## Naive rejection has worst case exponential $E_{cost}$

The simplest procedure is to sample $S$ by concatenating $D$ independent $0,1$ samples and rejecting $S$ whenever $S\gt U$.

We will write $E_{cost}$ for this procedure as

$E_{cost}=p_{acc}D + p_{rej}(E_{waste}+E_{cost})\tag{1}$

where $p_{acc}$ is the probability that we accept $S$$p_{rej}(=1-p_{acc})$ is the probability that we reject $S$, and $E_{waste}$ is the expected number of $0,1$ samples used ("wasted") per reject $S$that we reject.

This equation can be rewritten as

$E_{cost}=D + \frac{p_{rej}}{p_{acc}}E_{waste}\tag{2}.$

It is easy to see that if we fix $U=000...0$ then $p_{rej}=\frac{2^D-1}{2^D}$$p_{acc}=\frac{1}{2^D}$, therefore $\frac{p_{rej}}{p_{acc}}=2^D-1$. Note that under naive rejection sampling $E_{waste}=D$. This shows that $E_{cost}$has worst case (over choice of $U$) exponentially large magnitude using naive rejection sampling. Indeed, $0\leq E_{cost} \leq D(2^D-1$).

## Prefix matching and early rejection

We will modify naive rejection sampling in two ways and then show that under this modified procedure $E_{cost} \leq D+2$.

The modifcations are:
1. Prefix matching. We will perform rejection sampling restricted to the smallest subtree that contains all of the in-range leaves (instead of sampling from the entire tree). We will refer to this smallest containing subtree as the $SCS$. Note that there are $D$different $SCS$s and that each $U$ is contained under one of them.
2. Early rejection. We abort a tree descent as soon as the last $0,1$ sample places us in a subtree with no legal leaves rather than continuing for the full length $D$.
We will call naive rejection sampling with these two modifications modified sampling.

## Under modified sampling, $p_{rej}\over p_{acc}$ achieves maximum at $U=100...0$

Under prefix matching, the $SCS$ has the property that all of its left child's leaves are in-range and at least one of its right child's leaves is in-range (if none of the right child's leaves are in-range then we can decrease the size of the $SCS$ by descending to the left child).

This implies that $p_{acc} \geq {1\over2}.$ Therefore, $p_{acc}\gt p_{rej}$ and therefore ${p_{rej}\over p_{acc}}\leq1.$
Generally,

$\frac{p_{rej}}{p_{acc}} = \frac{2^{D-k}-j}{2^{D-k}+j}\tag{Eqn. 3}$

where $k$ is the depth of the $SCS$ and $1 \geq j>2^{D-k-1}$ is the number of in-range leaves in that subtree depending on $U$.

$(3)$ is maximized when $k=0$ and $j=1$. This occurs for $U=100...0$.

## Under modified sampling, $E_{waste}$ achieves maximum at $U=100...0$

We now consider $E_{waste}$ as a function of $U$. Every $U$ specifies a unique path in the binary tree. At every depth $d$ in a path where the left child was chosen ($U_d=0$) the right child represents a subtree with $2^{D-(d+1)}$ leaves that can be early-rejected using $(d+1)$ samples.

This gives, for an $SCS$ of height $D$:

$E_{waste}(U)=\sum_{\{d|U_d=0\}}^{D} (d+1)(\frac{2^{D-(d+1)}}{2^D})=\sum_{\{d|U_d=0\}}^{D} \frac{(d+1)}{2^{(d+1)}}.\tag{4}$

For smaller $SCS$ there are two necessary observations:
1. For all $U$ that correspond to the same $SCS$ each additional $U_d=0$ contributes an additional positive term in $(4)$ and thus increases $E_{waste}(U)$. Therefore the $U$ that maximizes $E_{waste}(U)$ for a given $SCS$ is the $U$ with all possible elements $0$.
2. Across $SCS$ of different heights $E_{waste}(U)$ is maximized by the $SCS$ that is the full binary tree for $D$. This is because the max $U$ for smaller $SCS$ simply leave out terms from $(4)$ above.
From these observations we can conclude that for a $E_{waste}(U)$ is maximized over all $U$ at $U=100...0$ and the value of $E_{waste}(U)$ is $(4)$ with all terms in the sum included.

We can then use the fact that

$\lim_{D \to \infty} \sum_{d=0}^{D} \frac{(d+1)}{2^{(d+1)}}=2$

to conclude that $E_{waste} \leq 2$ and this is achieved at $U=100...0$.

## Worst case is therefore $D+2$

We have shown that the $U$ that maximises both ${p_{rej}\over p_{acc}}$ and $E_{waste}$ is $U=100...0$. Therefore the maximum of their product is also achieved at $U=100...0$ and in particular is at most:

$(\frac{2^{D}-1}{2^{D}+1})(2)$

which directly implies that $E_{cost} \leq D+2$.

## Extension to $n$-ary strings

Similar reasoning seems to show that $E_{cost} \leq D+n$.