# Author

SauravMaheshkar- @MaheshkarSaurav

# Introduction

Prior to the introduction of Wide Residual Networks (WRNs) by Sergey Zagoruyko and Nikos Komodakis, deep residual networks were shown to have a fractional increase in performance but at the cost of **doubling** the number of layers. This led to the problem of diminishing feature reuse and overall made the models slow to train. WRNs showed that having a wider residual network leads to better performance and increased the then SOTA results on CIFAR, SVHN and COCO.

In this notebook we run through a simple demonstration of training a WideResnet on the `cifar10`

dataset using the Trax framework. Trax is an end-to-end library for deep learning that focuses on **clear code and speed**. It is actively used and maintained in the *Google Brain team*.

# Issues with Traditional Residual Networks

Figure 1: *Various ResNet Blocks*

## Diminishing Feature Reuse

A **Residual block with a identity mapping**, which allows us to train very deep networks is a **weakness**. As the gradient flows through the network there is nothing to force it to go through the residual block weights and thus it can avoid learning during training. This only a few blocks can run valuable representations or many blocks could share very little information with small contributions to the final goal. This problem was tried to be addressed using a special case of dropout applied to residual blocks in which an identity scalar weight is added to each residual block on which dropout is applied.

As we are widening our residual blocks, this results in an increase in the number of parameters, and the authors decided to study the effects of dropout to regularize training and prevent overfitting. They argued that the dropout should be inserted between convolutional layers instead of being inserted in the identity part of the block and showed that this results in consistent gains, yielding new SOTA results.

The paper Wide Residual Networks attemptsto answer the question of how wide deep residual networks should be and address the problem of training.

# Residual Networks

$\large x_{l+1} = x_l + \mathbb{F}(x_l, W_l) $

This is the representation of a Residual block with an identity mapping.

$x_{l+1}$ and $x_l$ represent the input and output of the $l$-th unit in the network

$\mathbb{F}$ is a residual function

$W_l$ are the parameters

Figure 1(a) and 1(c) represent the fundamental difference between the *basic* and the *basic-wide* blocks used.