The alvaRunner project presented here contains the four BBB classification models described in the paper “Mauri, A., & Bertola, M. (2022). Alvascience: A New Software Suite for the QSAR Workflow Applied to the Blood – Brain Barrier Permeability. International Journal of Molecular Sciences, 23(21), 12882. https://doi.org/10.3390/ijms232112882“.

## alvaRunner project

This alvaRunner project contains four classification models:

- M1: a KNN model based on the MACCS 166 fingerprints
- M2: a KNN model based on the Extended Connectivity Fingerprints (ECFP)
- M3: a KNN model based on molecular descriptors (MD)
- MC: a Consensus model based on M1, M2 and M3 (the contained models)

The M3 model makes use of the following descriptors:

- Mp: mean atomic polarizability (scaled on Carbon atom)
- nN: number of nitrogen atoms included in the molecule
- MPC07: molecular path count of order 7 where a graph path is a walk without any repeated vertex, the path count of order 7 is the total number of paths of length 7 in the molecular graph
- NssssN+: number of quaternary ammonium cations included in the molecule
- SHED_DL: SHannon Entropy Descriptor (SHED) considering donor (D) and lipophilic (L) atoms
- SHED_AN: SHED considering acceptor (A) and negative (N) atoms
- F09[C-C]: frequency atom pair descriptor which counts all the atom pairs of carbon atoms at a topological distance equal to 9
- TPSA(Tot): topological polar surface area of a molecule which is defined as the sum of the surfaces of all polar atoms
- MLOGP2: square value of MLOGP where the MLOGP is the Moriguchi octanol-water partition coefficient model

The molecular descriptors included in the *M3 *model provide different kinds of information useful for discriminating between BBB+ and BBB- molecules.

Specifically, *Mp *provides information on molecule composition. It discriminates molecules according to their atomic polarizabilities. Given that atomic polarizabilities for hydrogen, fluorine, oxygen and nitrogen atoms are lower than for chlorine, sulphur, bromine, phosphorus and iodine, *Mp *has higher values for molecules having a lower rate of saturated bonds (i.e., when the number of hydrogen atoms decreases) and when the percentage of atoms belonging to the second group increase.

*nN *discriminates molecules based on the number of nitrogen atoms included in the molecule. Considering the training set, it includes 2997 over a total of 3525 molecules having at least one nitrogen atom. The number of nitrogen atoms included in the training set ranges from 0 to 20.

*MPC07 *is the molecular path count of order 7 and provides information both on the molecular size and complexity.

Analogously to *nN*, *NssssN+* provides information on nitrogen atoms, specifically it represents the number of quaternary ammonium cations included in the molecule. The training set includes 61 molecules having at least one quaternary ammonium cation.

*SHED_DL* is calculated from the distributions of donor and lipophilic atoms in the molecule and can be used to quantify the variability in this distribution in the molecule. *SHED_DL* assumes values equal to 1 for molecules where all donor-lipophilic atom pairs are at the same topological distance and values near to 20 for molecules where donor-lipophilic atom pairs topological distances are uniformly distributed.

Likewise *SHED_DL*, *SHED_AN* is calculated from the distributions of acceptor and negative atoms and provides information related to the distribution of acceptor and negative atoms in the molecule. Molecules having all the acceptor-negative atom pairs at the same topological distance will have value equal to 1, while for molecules where acceptor and negative atoms are at different topological distances, *SHED_AN* will tend to 20.

*F09[C-C]* counts all the atom pairs of carbon atoms at a topological distance equal to 9. *F09[C-C]* grows with the molecule size and complexity considering only the carbon atoms.

*TPSA(Tot)* is an estimation of the polar surface area of a molecule. Polar surface area has been used in medicinal chemistry for the optimisation of a drug’s capability to permeate cells and it is considered an important descriptor to evaluate the blood-brain barrier penetration.

Finally, *MLOGP2* is the square value of the Moriguchi octanol-water partition coefficient model and it is a recurrent descriptor in QSAR model for the prediction of blood-brain barrier permeability.

Each model has also an Applicability Domain (AD). In particular, the AD of the consensus model (MC) is calculated as the conjunction of the contained models’ active ADs. In this way, a molecule is considered within the consensus model’s AD only if it is within the ADs of all the contained models.

The data curation process is described in the paper. The resulting dataset was divided into training and test sets:

Dataset | Total | BBB+ | BBB- |

Training | 3525 | 2257 | 1268 |

Test | 359 | 204 | 155 |

The scores of the models of the alvaRunner project are presented in the following table:

Model | k | Distance | TrainingSensitivity | TrainingSpecificity | TrainingAccuracy | Training Balanced Accuracy | CVAccuracy | TestSensitivity | TestSpecificity | TestAccuracy |

M1 | 5 | Jaccard Tanimoto | 0.879 | 0.662 | 0.801 | 0.771 | 0.798 | 0.755 | 0.910 | 0.822 |

M2 | 5 | Jaccard Tanimoto | 0.902 | 0.619 | 0.800 | 0.761 | 0.796 | 0.775 | 0.890 | 0.825 |

M3 | 5 | Euclidean | 0.898 | 0.637 | 0.804 | 0.767 | 0.799 | 0.775 | 0.871 | 0.816 |

MC | – | – | 0.917 | 0.645 | 0.819 | 0.781 | 0.814 | 0.760 | 0.916 | 0.827 |