Model: Ready Biodegradability

In the paper “Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R., & Consonni, V. (2013). Quantitative structure-activity relationship models for ready biodegradability of chemicals. Journal of Chemical Information and Modeling, 53(4), 867–878. https://doi.org/10.1021/ci4000213“, the authors presented four QSAR classification models to predict the ready biodegradability of chemicals.

The information described in the paper was used to build the alvaRunner project that we present here. The original dataset includes 1,055 molecules, from which 3 duplicate molecules have been removed. The duplicate molecules had the same ready biodegradability class and same molecular structure except for their stereochemistry. In fact, they were identified by using alvaMolecule duplicates analysis with the ‘Ignore stereochemistry’ option. This dataset has been used as the training set whereas the test set used to evaluate the performance of the models comprises 670 molecules.

alvaRunner project

This alvaRunner project contains four classification models:

M1: a KNN model based on molecular descriptors (MD)
M2: a PLS-DA model based on molecular descriptors (MD)
M3: a SVM model based on molecular descriptors (MD)
MC: a Consensus model based on M1, M2 and M3 (the contained models)

The first model is a KNN model (k=6) and includes the following 12 molecular descriptors:

nO: number of Oxygen atoms
nHM: number of heavy atoms
C%: percentage of C atoms
SpMax_L: leading eigenvalue from Laplace matrix
J_Dz(e): Balaban-like index from Barysz matrix weighted by Sanderson electronegativity
nCp: number of terminal primary C(sp3)
nCb-: number of substituted benzene C(sp2)
SdssC: Sum of dssC E-states
NssssC: Number of atoms of type ssssC
F01[N-N]: Frequency of N – N at topological distance 1
F03[C-N]: Frequency of C – N at topological distance 3
F04[C-N]: Frequency of C – N at topological distance 4

The second model is a PLS-DA model and includes the following 22 molecular descriptors:

Me: mean atomic Sanderson electronegativity (scaled on Carbon atom)
Mi: mean first ionization potential (scaled on Carbon atom)
nO: number of Oxygen atoms
C%: percentage of C atoms
nCIR: number of circuits
LOC: lopping centric index
SpMax_A: leading eigenvalue from adjacency matrix (Lovasz-Pelikan index)
TI2_L: second Mohar index from Laplace matrix
SpMax_L: leading eigenvalue from Laplace matrix
SM6_L: spectral moment of order 6 from Laplace matrix
HyWi_B(m): hyper-Wiener-like index (log function) from Burden matrix weighted by mass
SpPosA_B(p): normalized spectral positive sum from Burden matrix weighted by polarizability
nN-N: number of N hydrazines
nArNO2: number of nitro groups (aromatic)
nCRX3: number of CRX3
N-073: Ar2NH / Ar3N / Ar2N-Al / R..N..R
SdO: Sum of dO E-states
B01[C-Br]: Presence/absence of C – Br at topological distance 1
B03[C-Cl]: Presence/absence of C – Cl at topological distance 3
B04[C-Br]: Presence/absence of C – Br at topological distance 4
F03[C-O]: Frequency of C – O at topological distance 3
F04[C-N]: Frequency of C – N at topological distance 4

In the original paper the second models included 23 molecular descriptors. In this project the Psi_i_1d has been removed since it includes some missing values.

The third model is a SVM model and includes the following 14 molecular descriptors:

nN: number of Nitrogen atoms
nX: number of halogen atoms
Psi_i_A: intrinsic state pseudoconnectivity index – type S average
SpMax_L: leading eigenvalue from Laplace matrix
SpMax_B(m): leading eigenvalue from Burden matrix weighted by mass
SM6_B(m): spectral moment of order 6 from Burden matrix weighted by mass
nCrt: number of ring tertiary C(sp3)
nCb-: number of substituted benzene C(sp2)
nArCOOR: number of esters (aromatic)
nN-N: number of N hydrazines
nHDon: number of donor atoms for H-bonds (N and O)
C-026: R–CX–R
NssssC: Number of atoms of type ssssC
F02[C-N]: Frequency of C – N at topological distance 2

In the following table we present the scores of the four models contained in the alvaRunner project:

	Training				Training CV				Test
Model name	Acc	B.Acc	Sn	Sp	Acc_cv	B.Acc_cv	Sn_cv	Sp_cv	Acc	B.Acc	Sn	Sp
M1 (12 desc)	0.873	0.865	0.842	0.888	0.867	0.857	0.828	0.887	0.866	0.840	0.780	0.900
M2 (22 desc)	0.854	0.824	0.734	0.914	0.849	0.823	0.743	0.903	0.861	0.821	0.728	0.914
M3 (14 desc)	0.880	0.883	0.890	0.875	0.861	0.861	0.862	0.861	0.875	0.849	0.791	0.908
MC	0.892	0.880	0.845	0.915	0.884	0.872	0.836	0.908	0.873	0.840	0.764	0.916

Acc: Accuracy, B.Acc: Balanced Accuracy, Sn: Sensitivity, Sp: Specificity, CV: cross-validation 5-fold (Venetian blinds)

All four models included in the alvaRunner project are associated with a specific applicability domain (AD) to identify chemicals that fall within the chemical space of the model:

for M1, a molecule is considered outside the AD if its distance, from the closest neighbor of the training set, is greater than two times the average distance of the training molecules
for M2 and M3 models, the Leverage method is used (threshold factor = 3)
a molecule is considered inside the AD of the MC model if it is also inside the AD of M1, M2 and M3 models

Download

Please, log in in order to access the content.