Case Study 12: Higgs Boson Replica Neural Network
Table of Contents
Click the hyperlinks to go to each section.
In order to hide or show the code blocks in the notebook, use the toggle button below.
In the research paper "Searching for Exotic Particles in High-Energy Physics with Deep Learning" (https://arxiv.org/pdf/1402.4735.pdf), researchers build out a deep learning model that we will try to replicate.
In this case study, we replicated the modeling performed by Baldi, Sadowski, & Whiteson in the research paper "Searching for Exotic Particles in High-Energy Physics with Deep Learning" (https://arxiv.org/pdf/1402.4735.pdf), on the Higgs boson dataset with deep neural networks (DNN).
The model used in the research paper was a 5-layer multi-perceptron (MLP) with tanh
activation, a weight decay ($L2$ regularization) coefficient of $1 x 10^{-5}$, and layers initialized with weights from the random normal distribution.
These hyperparamters are summerized in table 1.
Table 1. Model Architecture Hyperparameters
Parameterized Object | Node Count | Activation | Weight Initialization | Weight Decay |
---|---|---|---|---|
Layer 1 | 300 | tanh | random normal ($\mu = 0$, $\sigma = 0.1$) | $L2$ regularization, $1$ x $10^{-5}$ |
Layer 2 | 300 | tanh | random normal ($\mu = 0$, $\sigma = 0.05$) | $L2$ regularization, $1$ x $10^{-5}$ |
Layer 3 | 300 | tanh | random normal ($\mu = 0$, $\sigma = 0.05$) | $L2$ regularization, $1$ x $10^{-5}$ |
Layer 4 | 300 | tanh | random normal ($\mu = 0$, $\sigma = 0.05$) | $L2$ regularization, $1$ x $10^{-5}$ |
Layer 5 | 300 | tanh | random normal ($\mu = 0$, $\sigma = 0.001$) | $L2$ regularization, $1$ x $10^{-5}$ |
In addition to the model architecture, we also replicated the training process. The model was trainined with stochastic gradient descent (SGD) with a batch size of 100. The learning rate was initialized at 0.05 and decreased by a factor of 1.0000002 on each batch to a minumum rate of $1$x$10^{-5}$. The momentum was initialized to 0.9 and increased linearly to 0.99 over 200 epochs, remaining constant after the 200th epoch. For the stopping criterion, the research paper indicates that early stopping with minimum change in error of 0.00001 over 10 epochs was used to determine when to stop the training process (resulting in training the model over 200-1000 epochs). However, the research paper does not indicate what error metric was monitored for early stopping.
The goal of this case study is to answer the following:
- The first column is the target label
- Use regularization
- Weight decay – a learning rate change. You can set or change the learning rate with a scheduler. At the beginning the learning rate is high and it slows down later. This is probably set as part of the optimizer. You can have a function as an optimizer and set its parameters. So set learning rate here
- Use tensorboard to watch the training happen in real time
- Want your accuracy to be at least 60%. You should be about 90-99% of their AUC score. Get at least in the 80s of the AUC . Refine it based on lots of little things they’re doing
- We just want to duplicate their research, not hyperparameter tune on our own
- Shape= how many columns are in the data
The dataset used in this case study was produced using Monte Carlo simulations. The dataset comprised of signal processes which produced Higgs bosons and background processes that did not produce Higgs bosons [Ref: HIGGS Data Set].
The dataset contained:
- 11 million instances of 28 features:.
- The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator.
- The last 7 features are functions of the first 21 features (high-level features derived by physicists).
The target variable is a binary indicator where 1 indicates a signal processes which produced Higgs bosons and 0 indicates a background process. The reference research paper indicates that the last 500,000 instances in this training set were used for model validation.
Import various modules
Various python modules and packages used in this notebook are imported in this section.
Load the original data set
Various methods to import the large dataset
- Import the dataset from Google Cloud Storage using BigQuery
- Import the dataset from Google Drive
- Import the dataset from local computer
- Import the dataset using TensorFlow utility
Any one of these methods is sufficient to import the large dataset.
Import the dataset from Google Cloud Storage using BigQuery
Loaded the HIGGS dataset from Google Cloud Storage using BigQuery and saved it as 'df' pandas dataframe.
Import the dataset from Google Drive
Loaded the HIGGS dataset from Google Drive and saved it as 'df' pandas dataframe.
Import the dataset from local computer
Loaded the HIGGS dataset from local computer and saved it as 'df' pandas dataframe.
Import the dataset using TensorFlow utility
Loaded the HIGGS dataset direct from https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz and saved it as 'df' pandas dataframe.
Attribute Information: The first column is the class label (1 for signal, 0 for background) followed by the 28 features (21 low-level features then 7 high-level features): lepton pT, lepton eta, lepton phi, missing energy magnitude, missing energy phi, jet 1 pt, jet 1 eta, jet 1 phi, jet 1 b-tag, jet 2 pt, jet 2 eta, jet 2 phi, jet 2 b-tag, jet 3 pt, jet 3 eta, jet 3 phi, jet 3 b-tag, jet 4 pt, jet 4 eta, jet 4 phi, jet 4 b-tag, m_jj, m_jjj, m_lv, m_jlv, m_bb, m_wbb, m_wwbb.
Make a copy of the original dataframe for later use
A copy of the loaded dataframe is saved as 'df_original'.
Check the shape and data types
Check for nulls
The dataframe is scanned to check for any null values.
Find Duplicate rows and columns in the data set
In this section, it was checked if the provided dataset had any duplicate rows:
- The output showed that there are 278698 rows which are duplicates.
- The below code marks duplicates as 'True' except for the first occurrence.
Identify duplicate rows in the entire data set, using information from all the columns. The below code marks duplicates as 'True' except for the first occurrence. df.duplicated(subset=None, keep='first')