In the paper “Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R., & Consonni, V. (2013). Quantitative structure-activity relationship models for ready biodegradability of chemicals. Journal of Chemical Information and Modeling, 53(4), 867–878. https://doi.org/10.1021/ci4000213“, the authors presented four QSAR classification models to predict the ready biodegradability of chemicals.

The information described in the paper was used to build the alvaRunner project that we present here. The original dataset includes 1,055 molecules, from which 3 duplicate molecules have been removed. The duplicate molecules had the same *ready biodegradability* class and same molecular structure except for their stereochemistry. In fact, they were identified by using alvaMolecule duplicates analysis with the *‘Ignore stereochemistry’* option. This dataset has been used as the *training set* whereas the *test set* used to evaluate the performance of the models comprises 670 molecules.

## alvaRunner project

This alvaRunner project contains four classification models:

- M1: a KNN model based on molecular descriptors (MD)
- M2: a PLS-DA model based on molecular descriptors (MD)
- M3: a SVM model based on molecular descriptors (MD)
- MC: a Consensus model based on M1, M2 and M3 (the contained models)

The first model is a KNN model (*k=6*) and includes the following 12 molecular descriptors:

- nO: number of Oxygen atoms
- nHM: number of heavy atoms
- C%: percentage of C atoms
- SpMax_L: leading eigenvalue from Laplace matrix
- J_Dz(e): Balaban-like index from Barysz matrix weighted by Sanderson electronegativity
- nCp: number of terminal primary C(sp3)
- nCb-: number of substituted benzene C(sp2)
- SdssC: Sum of dssC E-states
- NssssC: Number of atoms of type ssssC
- F01[N-N]: Frequency of N – N at topological distance 1
- F03[C-N]: Frequency of C – N at topological distance 3
- F04[C-N]: Frequency of C – N at topological distance 4

The second model is a PLS-DA model and includes the following 22 molecular descriptors:

- Me: mean atomic Sanderson electronegativity (scaled on Carbon atom)
- Mi: mean first ionization potential (scaled on Carbon atom)
- nO: number of Oxygen atoms
- C%: percentage of C atoms
- nCIR: number of circuits
- LOC: lopping centric index
- SpMax_A: leading eigenvalue from adjacency matrix (Lovasz-Pelikan index)
- TI2_L: second Mohar index from Laplace matrix
- SpMax_L: leading eigenvalue from Laplace matrix
- SM6_L: spectral moment of order 6 from Laplace matrix
- HyWi_B(m): hyper-Wiener-like index (log function) from Burden matrix weighted by mass
- SpPosA_B(p): normalized spectral positive sum from Burden matrix weighted by polarizability
- nN-N: number of N hydrazines
- nArNO2: number of nitro groups (aromatic)
- nCRX3: number of CRX3
- N-073: Ar2NH / Ar3N / Ar2N-Al / R..N..R
- SdO: Sum of dO E-states
- B01[C-Br]: Presence/absence of C – Br at topological distance 1
- B03[C-Cl]: Presence/absence of C – Cl at topological distance 3
- B04[C-Br]: Presence/absence of C – Br at topological distance 4
- F03[C-O]: Frequency of C – O at topological distance 3
- F04[C-N]: Frequency of C – N at topological distance 4

In the original paper the second models included 23 molecular descriptors. In this project the* Psi_i_1d* has been removed since it includes some missing values.

The third model is a SVM model and includes the following 14 molecular descriptors:

- nN: number of Nitrogen atoms
- nX: number of halogen atoms
- Psi_i_A: intrinsic state pseudoconnectivity index – type S average
- SpMax_L: leading eigenvalue from Laplace matrix
- SpMax_B(m): leading eigenvalue from Burden matrix weighted by mass
- SM6_B(m): spectral moment of order 6 from Burden matrix weighted by mass
- nCrt: number of ring tertiary C(sp3)
- nCb-: number of substituted benzene C(sp2)
- nArCOOR: number of esters (aromatic)
- nN-N: number of N hydrazines
- nHDon: number of donor atoms for H-bonds (N and O)
- C-026: R–CX–R
- NssssC: Number of atoms of type ssssC
- F02[C-N]: Frequency of C – N at topological distance 2

In the following table we present the scores of the four models contained in the alvaRunner project:

Training |
Training CV |
Test |
||||||||||

Model name | Acc | B.Acc | Sn | Sp | Acc_{cv} |
B.Acc_{cv} |
Sn_{cv} |
Sp_{cv} |
Acc | B.Acc | Sn | Sp |

M1 (12 desc) | 0.873 | 0.865 | 0.842 | 0.888 | 0.867 | 0.857 | 0.828 | 0.887 | 0.866 | 0.840 | 0.780 | 0.900 |

M2 (22 desc) | 0.854 | 0.824 | 0.734 | 0.914 | 0.849 | 0.823 | 0.743 | 0.903 | 0.861 | 0.821 | 0.728 | 0.914 |

M3 (14 desc) | 0.880 | 0.883 | 0.890 | 0.875 | 0.861 | 0.861 | 0.862 | 0.861 | 0.875 | 0.849 | 0.791 | 0.908 |

MC | 0.892 | 0.880 | 0.845 | 0.915 | 0.884 | 0.872 | 0.836 | 0.908 | 0.873 | 0.840 | 0.764 | 0.916 |

*Acc: Accuracy, B.Acc: Balanced Accuracy, Sn: Sensitivity, Sp: Specificity, CV: cross-validation 5-fold (Venetian blinds)*

All four models included in the alvaRunner project are associated with a specific applicability domain (AD) to identify chemicals that fall within the chemical space of the model:

- for M1, a molecule is considered outside the AD if its distance, from the closest neighbor of the training set, is greater than two times the average distance of the training molecules
- for M2 and M3 models, the
*Leverage*method is used (threshold factor = 3) - a molecule is considered inside the AD of the MC model if it is also inside the AD of M1, M2 and M3 models