Self Normalizing Flows

In this work, we take a different path than most normalizing flow literature, and instead of trying to simplify or approximate the computation of Jacobian determinant, we approximate its gradient. To do this, we leverage the inverse function theorem, which states that for invertible functions, the inverse of the Jacobian matrix is given by the Jacobian of the inverse function, i.e. $\mathbf{J}_f^{-1} = \mathbf{J}_{f^{-1}}$. Thus, if we are able approximate the inverse of $f$, we have simultaneously achieved an approximation for the gradient required to train $f$. In other words, if $f^{-1} \approx g$, then $\mathbf{J}_{f}^{-1} \approx \mathbf{J}_{g}$, and we have avoided computation of both the inverse and the Jacobian determinant.⁴

Following this idea, we propose to define and parameterize both the forward and inverse functions $f$ and $g$ with parameters $\theta$ and $\gamma$ respectively. We then propose to constrain the parameterized inverse $g$ to be approximately equal to the true inverse $f^{-1}$ though a reconstruction loss. We can thus define our maximization objective as the mixture of the log-likelihoods induced by both models minus the reconstruction penalty constraint, i.e.:

$$\begin{equation} \begin{split} \mathcal{L}(\mathbf{x}) & = \frac{1}{2} \log p^{f}_{\mathbf{X}}(\mathbf{x}) + \frac{1}{2} \log p^{g}_{\mathbf{X}}(\mathbf{x}) - \lambda ||g \left( f (\mathbf{x}) \right) - \mathbf{x} ||^2_2 \end{split} \end{equation}$$

We see that when $f = g^{-1}$ exactly, this is equivalent to the traditional normalizing flow framework.

Examples

Self Normalizing Fully Connected Layer

As a specific case of the above model, we consider a single fully connected layer, as exemplified in the figure at the top of this post. Let $f(\mathbf{x}) = \mathbf{W} \mathbf{x} = \mathbf{z}$, and $g(\mathbf{z}) = \mathbf{R} \mathbf{z}$, such that $\mathbf{W}^{-1} \approx \mathbf{R}$. Taking the gradient of objective for this model, and approximating inverse matricies with their learned inverses, we get:

$$\begin{equation} \begin{split} \frac{\partial}{\partial \mathbf{W}} \mathcal{L}(\mathbf{x}) & = \frac{1}{2} \underbrace{\left(\frac{\partial}{\partial \mathbf{W}} \log p_{\mathbf{Z}}(\mathbf{W} \mathbf{x}) + \mathbf{W}^{-T}\right)}_{\textstyle \approx \delta_{\mathbf{z}} \mathbf{x}^T + \mathbf{R}^T} \ - \frac{\partial}{\partial \mathbf{W}} \lambda \mathcal{E} \end{split} \end{equation}$$ $$\begin{equation} \begin{split} \frac{\partial}{\partial \mathbf{R}} \mathcal{L}(\mathbf{x}) & = \frac{1}{2} \underbrace{\left(\frac{\partial}{\partial \mathbf{R}} \log p_{\mathbf{Z}}(\mathbf{R}^{-1} \mathbf{x}) - \mathbf{R}^{-T} \right)}_{\textstyle \approx - \delta_{\mathbf{x}} \mathbf{z}^T - \mathbf{W}^T}\ - \frac{\partial}{\partial \mathbf{R}} \lambda \mathcal{E} \\ \end{split} \end{equation}$$

where $\mathcal{E}$ denotes the reconstruction error, $\frac{\partial}{\partial \mathbf{W}} \lambda \mathcal{E} = 2\lambda \mathbf{R}^T(\hat{\mathbf{x}} - \mathbf{x}) \mathbf{x}^T$, $\frac{\partial}{\partial \mathbf{R}} \lambda \mathcal{E} = 2\lambda (\hat{\mathbf{x}} - \mathbf{x})\mathbf{z}^T$, $\hat{\mathbf{x}} = \mathbf{R}\mathbf{W}\mathbf{x}$, $\delta_{\mathbf{z}} = \frac{\partial \log p_{\mathbf{Z}}(\mathbf{z})}{\partial \mathbf{z}}$, and $\delta_{\mathbf{x}} = \frac{\partial \log p_{\mathbf{Z}}(\mathbf{z})}{\partial \mathbf{x}}$ are ordinarily computed by backpropagation. We observe that by using such a self normalizing layer, the gradient of the log-determinant of the Jacobian term is approximately given by the weights of the inverse transformation, sidestepping computation of the Jacobian determinant and all matrix inverses.

Self Normalizing Convolutional Layer

To construct a self normalizing convolutional layer, let $f(\mathbf{x}) = \mathbf{w} \star \mathbf{x} = \mathbf{z}$, and $g(\mathbf{z}) = \mathbf{r} \star \mathbf{z}$, such that $f^{-1} \approx g$, where $\star$ is the convolution operation. We first note that the inverse of a convolution operation is not necessarily another convolution. However, for sufficiently large $\lambda$, we observe that $f$ is simply restricted to the class of convolutions which is approximately invertible by a convolution. As we derive fully in the paper, the approximate self normalizing gradients with respect to a convolutional kernel $\mathbf{w}$, and corresponding inverse kernel $\mathbf{r}$, are given by: \[\begin{equation} \begin{split} \frac{\partial }{\partial \mathbf{w}} \log p^{f}_{\mathbf{X}}(\mathbf{x}) & \approx \delta_{\mathbf{z}} \star \mathbf{x} + \mathrm{flip}(\mathbf{r}) \odot \mathbf{m} \\ \end{split} \end{equation}\] \[\begin{equation} \begin{split} \frac{\partial }{\partial \mathbf{r}} \log p^{g}_{\mathbf{X}}(\mathbf{x}) & \approx -\delta_{\mathbf{x}} \star \mathbf{z} - \mathrm{flip}(\mathbf{w}) \odot \mathbf{m} \\ \end{split} \end{equation}\]

where $\mathrm{flip}(\mathbf{r})$ corresponds to the kernel which achieves the transpose convolution and is given by swapping the input and output channels, and mirroring the spatial dimensions. The constant $\mathbf{m}$ is given by the number of times each element of the kernel $\mathbf{w}$ is present in the matrix form of convolution. The gradients for the reconstruction loss are then added to these to maintain the desired constraint.

Results

We evaluate our model by incorporating the above self normalizing layers into a variety of flow architectures and training them to maximize the proposed mixture objective on datasets including MNIST, CIFAR-10, and ImageNet32. We refer to the paper to see the full results and experiment details, but provide some highlights here. In the figure below, we show that a basic flow model composed of 2 self normalizing fully connected layers (SNF FC 2L) achieves similar (or better) performance than its exact gradient counterpart (Exact FC 2L), while training significantly faster.

2L Inset

We can additionally see that samples from the model appear to match samples from the true data distribution. In the figure below, we show samples from $p_{\mathbf{Z}}$ passed through both the true inverse $\mathbf{x} = f^{-1}(\mathbf{z})$ (top) and the learned inverse $\mathbf{x} = g(\mathbf{z})$ (bottom). We see that the samples are virtually indistinguishable, thus implying that the model has learned to approximate its own inverse well, while simultaneously learning a good model of the data distribution (since the samples look like realistic images).

Discussion

We see that the above framework yields an efficient update rule for flow-based models which appears to perform similarly to the exact gradient while allowing for the training of flow architectures which were otherwise computationally infeasible. We believe this opens up a broad range of new research directions involving less constrained flows, and intend to investigate such directions in future work. Furthermore, we believe the biological motivation and plausibility of this model are also worth exploring further. Ideally, we believe the gradient approximation used here could be combined with backpropagaion-free training methods such as Target Propagation, allowing for a fully backprop-free unsupervised density model.

Example Implementation

Below we give an example implementation of the above convolutional layer in Pytorch (which is also a generalization of the fully connected layer). We make use of the pytorch autograd API to create a fast and modular implementation. We note that this implementation still requires the layer-wise reconstruction gradients to be computed and added to this separately. The full code for the paper is available at the repository here: github.com/akandykeller/SelfNormalizingFlows

class SelfNormConvFunc(autograd.Function):
    @staticmethod
    def forward(ctx, x, W, bw, R, stride, padding, dilation, groups):
        z = F.conv2d(x, W, bw, stride, padding, dilation, groups)

        ctx.save_for_backward(x, W, bw, R, z)
        ctx.stride = stride
        ctx.padding = padding
        ctx.dilation = dilation
        ctx.groups = groups
        return z

    @staticmethod
    def backward(ctx, output_grad):
        x, W, bw, R, output = ctx.saved_tensors

        stride = torch.Size(ctx.stride)
        padding = torch.Size(ctx.padding)
        dilation = torch.Size(ctx.dilation)
        groups = ctx.groups
        benchmark = False
        deterministic = False

        multiple = _compute_weight_multiple(W.shape, output, x, padding, stride, 
                            dilation, groups, benchmark, deterministic)

        # Grad_W LogP(x)
        delta_z_xt = conv2d_backward.backward_weight(W.shape, output_grad, x, 
                                                     padding, stride, dilation, 
                                                     groups, benchmark, deterministic)
        weight_grad_fwd = (delta_z_xt - flip_kernel(R) * multiple) / 2.0

        # Grad_R LogP(x)
        input_grad = conv2d_backward.backward_input(x.shape, output_grad, W,
                                                    padding, stride, dilation,
                                                    groups, benchmark,
                                                    deterministic)
        Wx = output - bw.view(1, -1, 1, 1) if bw is not None else output
        neg_delta_x_Wxt = conv2d_backward.backward_weight(W.shape, -1*input_grad, Wx,
                                                          padding, stride, dilation, 
                                                          groups, benchmark, 
                                                          deterministic)
        weight_grad_inv = (neg_delta_x_Wxt + flip_kernel(W) * multiple) / 2.0

        if bw is not None:
            # Sum over all except output channel
            bw_grad = output_grad.view(output_grad.shape[:-2] + (-1,)).sum(-1).sum(0) 
        else:
            bw_grad = None

        return input_grad, weight_grad_fwd, bw_grad, weight_grad_inv, None, None, None, None

The above code requires the use of pytorch’s autograd functions backward_weight and backward_input. The pytorch implementations can be found here, however, these are much slower than the C++ implementations. To use the C++ implementations, we built a simply torch C++ extension. Please see the full repository for reference.

Finally, the above code makes use of two helper functions to compute the multiple, and ‘flip’ the kernel, these functions can be implemented as:

@lru_cache(maxsize=128)
def _compute_weight_multiple(wshape, output, x, padding, stride, dilation, 
                                 groups, benchmark, deterministic):
    batch_multiple =  conv2d_backward.backward_weight(wshape, 
                                           torch.ones_like(output),
                                           torch.ones_like(x),
                                           padding, stride, dilation,
                                           groups, benchmark, deterministic)
    return batch_multiple / len(x)


def flip_kernel(W):
    return torch.flip(W, (2,3)).permute(1,0,2,3).clone()

Similarly, the goal of probabilistic generative models can be seen as designing models which are able to generate data which appears to come from the same distribution as real data. ↩︎
Or equivalently, learn the inverse of each layer. We note that we are far from the first to propose this idea. The works of Difference Target Propagation, and (Rippel and Adams 2013) both use similar learned inverses at each layer for different purposes, just to name a few. Our work however, is the first to our knowledge to propose using the learned inverse as a direct approximation to computaionally expensive gradients. Please see the full paper for a full overview of related work. ↩︎
The gradient is given exaclty by the inverse parameters only in the case of linear transformations (since in this case the Jacobian of the transformation is equal to the linear map itself). For non-linear self normalizing flows we direct the reader to the general framework of our paper, where the Jacobian of the inverse transformation is used in-place of the inverse of the Jacobian of the forward transformation. ↩︎
This assumes that $f^{-1} \approx g$ implies $\mathbf{J}_{f^{-1}} \approx \mathbf{J}_g$. We note that this is true for all linear functions $f$, but may not be true for certain high frequency functions. In such cases, we propose the penalize the jacobians to be approximately equal directly. ↩︎

Video Summary

Introduction

Background