TL;DR
- To come up with an model architecture with better metrics on BM3 while using fingerprints as input
- Explore more ways to baseline non-weighted models for sub-structure data
Introduction
1. Background
- Name of the experiment: Scaffold-based subfragment resource propagation
Experiment start date: 01/04/2021
- Please look at initial observations and results for this project here: https://coda.io/d/Use-cases_dgX-TWPBu3W/210121-Promiscuity-model-using-compound-sub-structures_su3as#_luGKU
1.1 Experiment motivation
- The goal of this exercise is to reduce promiscuity in prediction. i.e Machine learning models tend to predict “active” for a lot of compounds. This happens a) When models don’t see enough inactive samples while training (relevant to kinase data) b) When models fail to generalize across different chemical space
- An ideal model will have to generalize over a new chemical space too . i.e not throw “active” for most compounds . Metrics on BM3 should look good with an auc_roc > 0.8 as our single target model have 0.8 auc_roc. Here we want to achieve something better than that.
- Goal overview: We want to achieve two things here 1) come up with a model that produces less % of actives i.e we are fp should be low 2) Model should be able to perform well with new chemical space
1.2 Experiment approcach
- To achieve our goals. we split this experiment into 2 blocks a) develop a model that learns from all targets simultaneously and produce less promiscuity. b) Use sub-structures as embedding and do the same. Why is it important to do step 1? Currently even with state of the art models such as chemprop and deepchem we are not able to get less promiscuous results / good metrics on our data.
- Our hypothesis is sub-strucutre present in compounds decide how a "parent compound" interacts with targets.
- We generate sub-structure score for all substrucutres present in a dataset and enrich it with a fingerprint of choice. In the beginning of this experiment we started with ecfp. Currently, we are working on MAACS and ECFP fingerpint.
- Before we could start seeing if sub-structures actually add any value to the learning it is important to have a baseline model that uses just the fingerprints
1.2.1 Model Approach
- We wanted to develop multitask models that can simultaneously learn from correlated tasks. So far with the state of the art MTL model (chemprop) on our BM3 data we had an average auc of only 0.5. As we have no baseline models that uses fingerprint as input: our first priority is to build a MTL model that uses compounds fingerprint as input
- Evaluate our substrucuture scores( obtained from bipartite graph) using non-weighted methods such using aggregating methods
- Build models that can learn from substructure scores
1.3 Results previous week
- Trained data on simple MTL mlp model
Data backgroud:
- Data was constructed from training_data_0925.parquet
- This is a dense dataset, compounds that have more than 200 interactions and targets that had more than 400 interactions were retained . The reason we need dense dataaset is we will be predicting on all interested targets and if there is a lot of
nan
chances are that the model won't learn anything due to the sparseness of the data - Number of unique compounds : 123483, Number of unique targets: 315
- Though this is a fairly dense dataset (less number of nans), we have only about 2% positive interactions present
- The starting data is very imbalanced. As you can see the starting positive% is only 2% , our assumption when starting this experiment was, as it is a multitask model correlated tasks would learn simultaneously and that would help with imbalance in the data
Basic MLP
Observation
- Used the mlp structure on both ecfp and MAACs data
- we did varaition of training a) vanilla training b) added penalty to loss while training
- while training we calcualte loss only for data points that has values i.e we mask out nan's present in the data
- This model had very bad metrics on BM3 with auc = 0.5 for both ecfp and MAACS fingerprint as input
- Linear layers are weights initialized wtih
kaiming_uniform
Notes
- This model is way too simple to learn anything from the data
- For maccs features that starts with 167 feature size, we predict 315 tasks in the end which also causes the model to learn nothing
- Add results here
1.4 To do from previous week
- Build deep models that could take fingerprint as input: assuming that would help with improvising metrics
- First pass on MAACS substrucutre score data : Substructure scores are created using bipartite graphs. Goal here is to map training data parent compoudnds to respective substructure scores.
1.4.1 Model using Residual block and Spatial Gate
- Residual convolutional block : This block mimics the residual block concept
out = f(x)+x
where x is the input.Here we use sequentialconv_bn (Conv1d->BatchNorm1d ->Relu)
asblock()
- Spatial gate: uses compressed version of input data and adds it back to the original data, acts as an attention layer between Residual convolution block
- Main block: Acts like an auto encoder / U-net with encoder part and decoder part consitis of above mentioned blocks
- Model was trained on both MAACS and ECFP features
1.4.2 Residual block with different basic blocks parameters
- Same architecture as above but used different activation layer SiLU as mentioned here https://arxiv.org/abs/1710.05941v1
- SELU works better than RELu dealing with vanishing gradients problem
1.5 Goals for this week
- Work on different model architecutres to imrpove metrics on Bm3 and reduce false positive prediciton
- Generate sub-strucutre scores for MAACS fingerprint from substructure score dictionary