Friday, August 21, 2020

Weight visualization of 0-layer autoencoder on MNIST


Border pixels are always 0 in MNIST and thus do not receive training signal; other pixels learn the identity function for their corresponding picture in the input.

Thursday, June 18, 2020

What do neural nets do on noise inputs?

In this post we train a standard convolutional architecture on MNIST and then feed random noise inputs to the trained neural net.

The standard architecture ouputs a softmax over the 10 class labels interpreted as the probability of each class given then input. On a noise input I suppose we would want all of the 10 probabilities to be small; specifically close to 1/10 since they must sum to 1.

The following histogram is produced by feeding 500 random noise inputs into an MNIST-trained architecture and counting the magnitude of the maximally activated class probability for each image. We can see for instance that ~20 of the noise images activated a class probability at over 90%.

It does not appear that neural nets handle noise inputs well.

Wednesday, October 9, 2019

A brief explanation of Heap's algorithm

A brief explanation of Heap's algorithm

Heap's algorithm (named after its inventor and not the data structure) generates all N!N! permutations while only swapping two elements between successive permutations and requiring only O(N)O(N) memory overhead and something like O(1)O(1) amortized computation per step. It's not clear that this should even be possible. Unfortunately understanding how it works doesn't seem to lend any deep insights into the fundamental nature of computation. Also unfortunately the explanation presented here requires calculating permutation compositions represented in cycle notation and seeing that it works for all N requires doing these calculations.


The following is the pattern of swaps generated by Heap's algorithm for N=6.

(1 5) (1 6) (1 5) (2 6) (1 5) (3 6) (1 5) (4 6) (1 5) (5 6) (1 5)
  1. Consider the first 3 swaps. These only involve elements 1, 5, 6. (Equivalent to (1)(5 6)).
  2. Consider the final 3 swaps. These only involve elements 1, 5, 6. (Equivalent to (1 6)(5)).
  3. Consider the middle swaps (all swaps not yet considered, from (2 6) to (4 6)). For these, the right column swaps will never interact with the left column swaps because they never have an element in common. The right column swaps will form a cycle and the left column will fix the 1 and 5. (1)(2 3 4 6)(5).
  4. In total we have:
(1)(5 6) (1)(2 3 4 6)(5) (1 6)(5)

Direct calculation reveals this reduces to a cycle:

(1 6 5 2 3 4)
  1. The general pattern for (even) N can be seen as:
(1)(N-1 N) (1)(2 3 4 5 ... N-2 N)(N-1) (1 N)(N-1)

which always reduces to

(1 N N-1 2 3 4 5 ... N-2)

in particular, a full cycle.


The following is the pattern of swaps generated by Heap's algorithm for N=5:

(1 4 3 2) (1 5) (1 4 3 2) (1 5) (1 4 3 2) (1 5) (1 4 3 2) (1 5) (1 4 3 2)

Direct calculation reveals this reduces to

(1 5)(2)(3)(4)

This is the general case for all (odd) N:

(1 N)(2)(3)...(N-1)

In other words only the first and last elements swap and the rest of the elements are fixed.

All together

From the point of view of an arbitrary index location K, Heap's algorithm alternates a prefix action with a specified sequence of pair swaps depending on the prefix action.

If the prefix action cycles the prefix as in case 2 above then the pair swaps are the (1 K), (1 K), ...,(1 K) sequence. This results in the prefix that includes K doing the head-tail swap.

If the prefix action is the head-tail swap then the pair swaps are (1 K), (2 K), (3 K),...,(K-1 K) as in case 1 above. This results in the prefix that includes K doing a full cycle.

We have to keep track of an O(N)O(N) array and scan into it each step to know what index to swap.

Tuesday, April 16, 2019

Uniformly sampling large binary integers with an upper bound


We wish to uniformly randomly sample binary strings of length D that when interpreted as integers lie between 0 and some upper bound U,\space0 \leq U \leq 2^D-1.

Equivalently we wish to uniformly randomly sample leaves of a binary tree of depth Dwhere the sampled leaves lie to the left of (or on) an upper bound leaf U.

We seek to minimize the number of individual 0,1 samples necessary to produce a uniformly distributed sample string S such that S \leq U. We denote the expected number of individual 0,1 samples E_{cost}.

 depends on DU, and the particular sampling procedure we use. However it is clear, over all choices of DU, and sampling procedure, that E_{cost} \geq D and achieves this lower bound for U=11...1. (When U=11...1 any leaf is valid and it takes D samples to select a leaf).

Naive rejection has worst case exponential E_{cost}

The simplest procedure is to sample S by concatenating D independent 0,1 samples and rejecting S whenever S\gt U.

We will write E_{cost} for this procedure as

E_{cost}=p_{acc}D + p_{rej}(E_{waste}+E_{cost})\tag{1}

where p_{acc} is the probability that we accept Sp_{rej}(=1-p_{acc}) is the probability that we reject S, and E_{waste} is the expected number of 0,1 samples used ("wasted") per reject Sthat we reject.

This equation can be rewritten as

E_{cost}=D + \frac{p_{rej}}{p_{acc}}E_{waste}\tag{2}.

It is easy to see that if we fix U=000...0 then p_{rej}=\frac{2^D-1}{2^D}p_{acc}=\frac{1}{2^D}, therefore \frac{p_{rej}}{p_{acc}}=2^D-1. Note that under naive rejection sampling E_{waste}=D. This shows that E_{cost}has worst case (over choice of U) exponentially large magnitude using naive rejection sampling. Indeed, 0\leq E_{cost} \leq D(2^D-1).

Prefix matching and early rejection

We will modify naive rejection sampling in two ways and then show that under this modified procedure E_{cost} \leq D+2.

The modifcations are:
  1. Prefix matching. We will perform rejection sampling restricted to the smallest subtree that contains all of the in-range leaves (instead of sampling from the entire tree). We will refer to this smallest containing subtree as the SCS. Note that there are Ddifferent SCSs and that each U is contained under one of them.
  2. Early rejection. We abort a tree descent as soon as the last 0,1 sample places us in a subtree with no legal leaves rather than continuing for the full length D.
We will call naive rejection sampling with these two modifications modified sampling.

Under modified sampling, p_{rej}\over p_{acc} achieves maximum at U=100...0

Under prefix matching, the SCS has the property that all of its left child's leaves are in-range and at least one of its right child's leaves is in-range (if none of the right child's leaves are in-range then we can decrease the size of the SCS by descending to the left child).

This implies that p_{acc} \geq {1\over2}. Therefore, p_{acc}\gt p_{rej} and therefore {p_{rej}\over p_{acc}}\leq1.

\frac{p_{rej}}{p_{acc}} = \frac{2^{D-k}-j}{2^{D-k}+j}\tag{Eqn. 3}

where k is the depth of the SCS and 1 \geq j>2^{D-k-1} is the number of in-range leaves in that subtree depending on U.

(3) is maximized when k=0 and j=1. This occurs for U=100...0.

Under modified sampling, E_{waste} achieves maximum at U=100...0

We now consider E_{waste} as a function of U. Every U specifies a unique path in the binary tree. At every depth d in a path where the left child was chosen (U_d=0) the right child represents a subtree with 2^{D-(d+1)} leaves that can be early-rejected using (d+1) samples.

This gives, for an SCS of height D:

E_{waste}(U)=\sum_{\{d|U_d=0\}}^{D} (d+1)(\frac{2^{D-(d+1)}}{2^D})=\sum_{\{d|U_d=0\}}^{D} \frac{(d+1)}{2^{(d+1)}}.\tag{4}

For smaller SCS there are two necessary observations:
  1. For all U that correspond to the same SCS each additional U_d=0 contributes an additional positive term in (4) and thus increases E_{waste}(U). Therefore the U that maximizes E_{waste}(U) for a given SCS is the U with all possible elements 0.
  2. Across SCS of different heights E_{waste}(U) is maximized by the SCS that is the full binary tree for D. This is because the max U for smaller SCS simply leave out terms from (4) above.
From these observations we can conclude that for a E_{waste}(U) is maximized over all U at U=100...0 and the value of E_{waste}(U) is (4) with all terms in the sum included.

We can then use the fact that

\lim_{D \to \infty} \sum_{d=0}^{D} \frac{(d+1)}{2^{(d+1)}}=2

to conclude that E_{waste} \leq 2 and this is achieved at U=100...0.

Worst case is therefore D+2

We have shown that the U that maximises both {p_{rej}\over p_{acc}} and E_{waste} is U=100...0. Therefore the maximum of their product is also achieved at U=100...0 and in particular is at most:


which directly implies that E_{cost} \leq D+2.

Extension to n-ary strings

Similar reasoning seems to show that E_{cost} \leq D+n.