AI in drug discovery for CNS: blood-brain barrier permeability prediction

By Katerina Hynkova•

Updated on August 8, 2025

This article describes how to build an AI powered data app that predicts blood-brain barrier (BBB) permeability using machine learning techniques. The data preparation will be covered using the blood-brain barrier database (B3DB) that contains 7,807 compounds, feature engineering with Morgan fingerprints and molecular descriptors, training Random Forest and XGBoost models achieving ~91% accuracy, implementing a transformer-based MegaMolBART approach for learning complex molecular patterns, and deploying the solution as an interactive data application with similarity search capabilities to help researchers identify BBB-permeable drug candidates.

Use template ->

Blood brain barrier (BBB) challenges in CNS drug development

The BBB is formed by tightly joined endothelial cells, astrocytes, and pericytes, creating a selective filter that maintains brain homeostasis. While it protects the brain from pathogens and toxins, this barrier presents a major obstacle to CNS drug delivery. Small molecules above a certain size or polarity, and virtually all large biologics, cannot diffuse through the BBB without specialized transport mechanisms. As a result, many CNS drug candidates with otherwise good therapeutic profiles fail to reach their targets in the brain. Incorporating BBB permeability predictions into the drug discovery pipeline is therefore crucial. By triaging molecules that are unlikely to cross into the brain, researchers can prioritize better candidates and design drugs with properties favoring BBB penetration (e.g. smaller, less water-loving molecules that can slip through the barrier more easily, molecular weight <400 Da).

Data science and AI offer powerful tools to tackle this challenge. Machine learning models can be trained on known BBB permeability data to recognize patterns in molecular structure that correlate with crossing or not crossing the barrier. An AI agent that predicts BBB permeability could drastically speed up CNS drug discovery by quickly evaluating large libraries of compounds and suggesting which ones merit further testing. Below, an AI implementation is described, from assembling data and engineering features to modeling and deployment and how it helps biotech researchers make informed decisions for CNS programs.

How the B3DB dataset combines 50 literature sources to predict drug brain penetration

To build a robust predictor, the blood-brain barrier database (B3DB) was leveraged, the current benchmark dataset for BBB permeability research. B3DB compiles data from ~50 literature sources, combining multiple smaller BBB studies into one comprehensive resource. It contains 7807 compounds with experimentally determined BBB permeability labels: either BBB+ (permeable) or BBB− (impermeable). (A subset of ~1058 compounds also have numeric logBB values, the brain-to-blood concentration ratio, but our project focused on the binary classification task.) This large, diverse dataset provided a solid foundation for training the AI agent, addressing limitations of earlier studies that used much smaller datasets.

Each entry in B3DB includes a chemical structure given as a SMILES string (SMILES = a line notation used in cheminformatics to represent the structure of chemical molecules using a short, unambiguous string of ASCII characters), the compound name, and the BBB permeability class among other fields. All available B3DB records was loaded and performed basic data cleaning and preprocessing (redundant columns was removed and categorical BBB permeability labels was mapped (“BBB+” or “BBB–”) to a binary flag bbb_binary 1 for permeable, 0 for not permeable):

# Drop unusable or empty columns, and rows missing SMILES or BBB labels
cols_to_drop = ['class', 'source', 'comments', 'CAS', 'Inchi']
full_df = full_df.drop(columns=cols_to_drop)
full_df = full_df.dropna(subset=['smiles', 'BBB+/BBB-'])

# Standardize column names and encode BBB labels
full_df = full_df.rename(columns={'BBB+/BBB-': 'bbb_label', 'logBB': 'logbb'})
full_df['bbb_binary'] = full_df['bbb_label'].map({'BBB+': 1, 'BBB-': 0})

After cleaning, the compiled dataset contained over 17,000 records (including some duplicate entries of compounds from different sources). The class balance was reasonably good for machine learning: roughly 2/3 of the entries are BBB-permeable and 1/3 are not. This ratio reflects the bias in published data (more BBB+ compounds reported), and it was addressed by using balanced training methods (described below).

From SMILES to features: 2,048-bit fingerprints + key molecular descriptors

A critical step was transforming each molecule’s structure (given by its SMILES notation) into features that a machine learning model can understand. Two types of feature was employed:

Molecular descriptors: Simple physicochemical properties calculated from the SMILES. For example, each compound’s molecular weight was computed (MolWt) using RDKit. Other descriptors like topological polar surface area or octanol-water partition coefficient (cLogP - a measure of a molecule's lipophilicity) are also relevant, but to keep the feature set compact we started with just a few basic ones. The BBB label from B3DB was added as the target variable.
Chemical fingerprints: A more information-rich representation, fingerprints encode the presence of molecular substructures as binary vectors. We used the popular Morgan circular fingerprint (also known as ECFP) with radius 2 and 2048 bits. This essentially maps each molecule to a 2048-dimensional bit vector, where each bit indicates the presence/absence of a particular chemical fragment. Fingerprints are excellent for capturing structural patterns and are well-suited for similarity search and machine learning.

Using RDKit, these features was added to desired DataFrame. Below is a simplified example of how we generated the fingerprint for each SMILES:

from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem

def smiles_to_fingerprint(smiles, radius=2, nBits=2048):
    if isinstance(smiles, str):
        mol = Chem.MolFromSmiles(smiles)
        if mol:
            return np.array(AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits))
    return None

# Apply the fingerprint function to all molecules
full_df['fingerprint'] = full_df['smiles'].apply(smiles_to_fingerprint)
full_df = full_df[full_df['fingerprint'].notnull()]  # drop any compounds with invalid SMILES

Converting SMILES to Morgan fingerprint features (a way to represent molecules as numerical vectors). Each molecule is encoded as a 2048-bit vector (numpy array) representing its 2D substructures. The entries where the SMILES could not be parsed (if any) was removed.

In addition to fingerprints, few intuitive descriptors was retained (like molecular weight) in a feature set. Prior studies have shown that molecular size, lipophilicity, charge, and polar surface area are key factors influencing BBB permeability. For instance, compounds that cross the BBB often have molecular weight under ~450 Da and moderate cLogP values. The dataset confirmed some of these trends, e.g. BBB+ compounds tended to have slightly lower molecular weight on average than BBB− compounds.

From 91% accuracy with Random Forest to transformer-based predictions: Comparing traditional ML and MegaMolBART deep learning for BBB classification

With a processed dataset in hand, predictive models was trained to classify compounds as BBB-permeable or not. Two-pronged modeling strategies were set:

1. Baseline machine learning models: Firstly, classic supervised learning algorithms were applied, using the engineered features (descriptors and fingerprints). In particular, Random Forest (RF) classifier were trained and an XGBoost gradient boosted trees model. These ensemble methods handle high-dimensional data well and are less prone to overfitting than a single decision tree. Also with a Support Vector Machine (SVM) was experimented, using the fingerprint bits as input, though tree-based models were faster to train on this dataset. To evaluate performance, we held out a test set and performed 5-fold cross-validation on the training set for robust estimates.

Training the Random Forest was straightforward using scikit-learn:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

# Split data into train/test (e.g., 70/30 split stratified by bbb_binary)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1]))

Training a Random Forest classifier (using a subset of features matrix X and labels y). The model was evaluated with accuracy and ROC–AUC on a held-out test set.

The Random Forest achieved ~91% accuracy on the test set, with a ROC–AUC of ~0.93, indicating excellent discrimination between BBB+ and BBB− compounds. The precision-recall breakdown showed high recall (~95%) for BBB+ class, which is useful since we prefer to catch as many permeable compounds as possible (even at the expense of a few false positives). The XGBoost model yielded similar performance, and both outperformed the SVM baseline. These results are quite strong, considering the inherent noise in experimental BBB measurements and the diversity of the dataset.

2. Advanced deep learning model: In parallel, using a transformer-based molecular language model, a MegaMolBART (a transformer model pre-trained on millions of chemical structures) was employed. The idea is to use the transformer’s encoder to generate learned vector representations (embeddings) of each molecule’s SMILES, and then classify those via a neural network or gradient boosting. Such an approach can capture complex structural patterns that fixed descriptors might miss. In recent research, Huang et al. combined MegaMolBART’s encoder with an XGBoost classifier and achieved high accuracy in BBB prediction. It mirrored this approach in the AI agent: SMILES are fed into the MegaMolBART encoder to produce a dense feature vector, which an XGBoost model then classifies as BBB+ or BBB−.

The transformer-based model’s performance was on par with the Random Forest. In the tests, the MegaMolBART+XGBoost pipeline reached about 88% test accuracy, with an ROC–AUC around 0.88–0.90, confirming literature reports. This slight dip compared to the Random Forest’s 0.93 AUC could be due to hyperparameter tuning limits or the need for more task-specific fine-tuning.

How the AI implementation performs similarity search, applies molecular filters, and suggests BBB-permeable analogs

Building on these models, the BBB predictor was wrapped into an autonomous AI implementation that can interact with users, query molecules, and suggest compounds. The agent’s core is the predictive model (Random Forest or transformer) but it also includes additional capabilities to assist researchers:

Similarity search: Often a scientist has a particular lead compound and wants to find analogues that might have better BBB permeability. The agent supports this by using vector search over chemical fingerprints. All compounds were indexed in the B3DB dataset by their 2048-bit Morgan fingerprint. Using Facebook AI Similarity Search - FAISS library (for fast approximate nearest neighbor search), the agent can rapidly retrieve the most similar compounds to any query molecule. The index was pre-built for all fingerprint vectors:

import faiss
# Prepare fingerprint matrix and build FAISS index for L2 similarity
fingerprints = np.stack(full_df['fingerprint'].values).astype('float32')
index = faiss.IndexFlatL2(fingerprints.shape[1])
index.add(fingerprints)

# Given a query molecule's fingerprint 'query_fp', search for top 10 nearest neighbors
D, I = index.search(np.array([query_fp], dtype='float32'), k=10)
results = full_df.iloc[I[0]][['smiles', 'compound_name', 'bbb_binary', 'mol_weight']]

Fingerprint similarity search with FAISS. All compound fingerprints was added to an index and then query it with the new compound’s fingerprint (query_fp). The result indices I[0] are used to retrieve the top similar compounds, along with their SMILES, names, BBB prediction (bbb_binary), and molecular weight.

This similarity lookup is extremely fast (sub-millisecond), even with ~7.8k compounds indexed. The returned candidate list lets the AI agent not only predict the query molecule’s class (by feeding it through the model) but also show “chemical neighbors” that the user can explore.

Autonomous filtering and suggestions: The agent can apply user-defined filters to the results. Scientists can specify constraints like molecular weight range (e.g. exclude very heavy compounds) or filter by BBB class (show only compounds predicted to be permeable, or only impermeable ones). Our app interface includes sliders for min/max molecular weight and a toggle to show only BBB+ or BBB– results.
Optional compound generation: As an exploratory feature, were considered to integrate a generative model to propose new analogs. While not fully implemented in this iteration, the concept is that the agent, after finding some BBB+ similar compounds, could use a trained molecular generative model (such as a transformer-based SMILES generator or reinforcement learning agent) to suggest novel compounds that combine desirable traits.

App for molecule screening

The AI agent was deployed as a data application using Deepnote, so that scientists and drug hunters can easily use it without diving into code.

SMILES input: The user enters a SMILES string of the compound of interest (for example, CCOC(=O)C1=CN=CN1C for the drug acyclovir). A default example like "CCO" (ethanol) is provided to demonstrate the format.

Number of results: The user selects how many similar compounds to retrieve (e.g. 5, 10, or 20 neighbors).
Molecular weight filter: Optionally, the user adjusts the min and max molecular weight sliders to narrow the results. This is useful to focus on drug-like compounds within a certain size range.
BBB class filter: The user can choose to display only compounds predicted to cross the BBB, only those not crossing, or all. For instance, setting this filter to "BBB+ only" will hide any retrieved analogues that are predicted BBB–.
Run query: The app then processes the input. It uses the AI model to predict the query molecule’s BBB permeability, finds the nearest neighbors from the database, applies the filters, and displays the results in a table.

The results table is dynamic. Each time the user changes a filter or inputs a new SMILES, the table updates to show the new set of compounds. For easier chemical interpretation, the app also visualizes each molecule’s structure (using RDKit to draw the 2D structure from the SMILES). This way, a medicinal chemist can immediately see the structures of close analogs and assess why they might be permeable or not.

Conclusion

This project shows how an AI implementation, built with data science techniques and domain knowledge can significantly aid biotech research and decision-making. By predicting blood-brain barrier permeability quickly and accurately, the agent helps researchers filter out compounds unlikely to succeed as CNS drugs, focusing their attention on more promising candidates. It also provides supporting information (like similar known BBB+ compounds) to rationalize its predictions, thus functioning not just as a black box classifier but as a research partner that offers insights.

Deploying the solution as an interactive app showcases the power of modern data science platforms in biotech. It enables collaboration (team members can try their own molecules), reproducibility, and rapid iteration on the model with new data.

Frequently asked questions (FAQ)

What are Morgan fingerprints?

Morgan fingerprints are a way to encode molecular structures as binary vectors (like [1,0,0,1,1,0...]). They capture circular patterns around each atom at different radii, creating a fixed-length "fingerprint" that represents the presence or absence of molecular substructures. We used 2048-bit Morgan fingerprints to encode molecules for our BBB predictor.

How is AI used in biotech?

AI accelerates biotech through drug discovery (predicting molecular properties, designing new compounds), protein engineering (structure prediction like AlphaFold), genomics (analyzing DNA sequences), diagnostics (medical image analysis), and personalized medicine. Our BBB predictor exemplifies AI helping identify brain-penetrating drugs faster.

AI in drug discovery

AI revolutionizes drug discovery by screening millions of compounds virtually, predicting properties like toxicity and bioavailability, optimizing lead compounds, identifying new drug targets, and even designing entirely new molecules. This reduces the time and cost of bringing new drugs to market from decades to potentially just years.

What is XGBoost?

XGBoost is a machine learning algorithm that builds multiple decision trees sequentially, where each tree learns from the mistakes of previous ones. It's fast, accurate, and handles complex patterns well - which is why it achieved 91% accuracy in our BBB predictor. It's particularly popular in drug discovery for its performance on tabular data.

How to use XGBoost in Python?

import xgboost as xgb
# Train model
model = xgb.XGBClassifier(n_estimators=100, max_depth=6)
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)

BERT model in drug discovery

BERT-based models like MolBERT, ChemBERTa, and MegaMolBART treat molecular SMILES as a "language" and learn chemical patterns from millions of molecules. These pre-trained models understand molecular grammar and can be fine-tuned for tasks like property prediction, achieving better results than traditional methods.

What is model BERT?

BERT (Bidirectional Encoder Representations from Transformers) is a language model that reads text in both directions to understand context. It revolutionized NLP by pre-training on massive text data, then fine-tuning for specific tasks. In chemistry, BERT treats molecules like sentences, atoms like words, learning the "language" of chemistry to make accurate predictions.

References:

Meng et al., Sci. Data 8, 289 (2021) – A curated BBB permeability dataset (B3DB) combining 50 sources, with 7807 compounds labeled BBB+ or BBB−.
Asthana et al., Sci. Rep. 15, 7431 (2025) – On challenges in CNS drug development; notes <2% of drugs reach the brain due to the BBB.
National Academies Press (2018) – Traversing the BBB: Challenges and Opportunities; describes BBB as a major obstacle to CNS drug delivery.
Huang et al., Sci. Rep. 14, 15844 (2024) – Developed a MegaMolBART (transformer) + XGBoost model for BBB permeability, achieving AUC ≈0.88.
Yang et al., Biomedicine (Taipei) 14(4):82 (2024) – Used a transformer-based MegaMolBART encoder with XGBoost, effectively predicting BBB permeability on B3DB data.
Jia & Sosso, J. Chem. Inf. Model. 64(12):8718 (2024) – Emphasizes interpretability: a transparent ML model with simple descriptors can predict BBB permeability comparably to complex models.
Kumar et al., Mol. Syst. Des. Eng. 9, 729 (2024) – Introduces a ML q-RASAR approach for BBB permeability; highlights key features (hydrophobicity, ionization, etc.) for BBB transit and the value of computational platforms in drug design.
Zhuang et al., Mol. Pharm. 13, 4199 (2016) – Machine learning model on ~2358 molecules for BBB (earlier study with smaller data).
Pardridge, Drug Discov. Today 12, 54 (2007) – Classic reference noting that >98% of small-molecule drugs do not cross the BBB, underlining the need for delivery strategies.
B3DB GitHub Repository – Free access to the B3DB dataset and documentation. (Includes raw data and descriptors for BBB permeability modeling.)

Katerina Hynkova

That’s it, time to try Deepnote

Get started – it’s free

Book a demo

The notebook manifesto

Data analytics

Data engineering

Machine learning

Fintech & Finance

Biotechnology

Gaming

Enterprise

Startups

Research

Use cases