Predicting the boiling point of PCDD/Fs by the QSPR method based on the molecular distance-edge vector index |
Long JiaoI,*; Xiaofei WangI; Shan BingI; Zhiwei XueII; Hua LiIII
ICollege of Chemistry and Chemical Engineering, Xi'an Shiyou University, Xi'an 710065, China Recebido em 17/10/2014 *e-mail: mop@xsyu.edu.cn The quantitative structure property relationship (QSPR) for the boiling point (Tb) of polychlorinated dibenzo-p-dioxins and polychlorinated dibenzofurans (PCDD/Fs) was investigated. The molecular distance-edge vector (MDEV) index was used as the structural descriptor. The quantitative relationship between the MDEV index and Tb was modeled by using multivariate linear regression (MLR) and artificial neural network (ANN), respectively. Leave-one-out cross validation and external validation were carried out to assess the prediction performance of the models developed. For the MLR method, the prediction root mean square relative error (RMSRE) of leave-one-out cross validation and external validation was 1.77 and 1.23, respectively. For the ANN method, the prediction RMSRE of leave-one-out cross validation and external validation was 1.65 and 1.16, respectively. A quantitative relationship between the MDEV index and Tb of PCDD/Fs was demonstrated. Both MLR and ANN are practicable for modeling this relationship. The MLR model and ANN model developed can be used to predict the Tb of PCDD/Fs. Thus, the Tb of each PCDD/F was predicted by the developed models. INTRODUCTION Polychlorinated dibenzo-p-dioxins and polychlorinated dibenzofurans (PCDD/Fs) are two series of persistent organic pollutants which have been detected in almost all compartments of the global ecosystem. These chemicals have gained much attention due to their toxicity, environmental persistence, tendency to accumulate through the food chain, and the risk to human health. PCDD/Fs are not produced intentionally and do not serve any useful purpose. They are formed as byproducts of many industrial and combustion processes. PCDD/Fs are semi-volatile compounds. After released into the atmosphere, they are likely to transfer to other environmental compartments such as soil, water, sediments and their resident biota where they can last for years before degradation.1-5 The boiling point (Tb) is an important property for studying the volatility of PCDD/Fs, which is correlated with the fate, transport, and transformation of PCDD/Fs in the environment. Boiling point is also a significant factor in determining physico-chemical properties of PCDD/Fs, such as vapor pressure, octanol/water partitioning coefficient and aqueous solubility.6-9 A quantitative study on the Tb is necessary to understand the environmental behavior of PCDD/Fs. Experimentally determining the Tb of PCDD/Fs is still a hard work because of the complexity of analytical methods, high cost of experiments and lack of the standards.7 In addition, the measurement of boiling point of PCDD/Fs is hazardous due to the high vapor pressures involved.6 Up to now, the Tb has not been experimentally determined for each PCDD/F congener. Quantitative structure property relationship (QSPR) method is safe, fast, convenient and cost-effective for predicting the property of compounds. Therefore, it is worthwhile to develop an accurate and easy-to-use QSPR model for predicting the Tb of PCDD/Fs. Topological index is a kind of structural descriptor which is often used in QSPR researches. It can efficiently describe the structure of a molecule without detailed molecular orbital calculations. It is useful because, despite its mathematical simplicity, topological index is able to differentiate molecules with different structures.10,11 The aim of this work is developing the QSPR model for the Tb of PCDD/Fs based on the topological index. Molecular distance-edge vector (MDEV) index12-17 was used as the structural descriptor of PCDD/Fs. Multivariate linear regression (MLR) and linear artificial neural network (L-ANN) were employed to model the quantitative relationship between the Tb and MDEV index of PCDD/Fs.
EXPERIMENTAL Data set The MDEV index was calculated according to the approach presented in the followed section. The MDEV index of the 52 PCDD/Fs, of which the Tb value is known, is listed in Table 1. The observed Tb value of these PCDD/Fs was taken from the references7,18 and listed in Table 2.
Root mean square relative error (RMSRE) was used to indicate the prediction performance of the developed models. The RMSRE is defined as Equation 1: where REi is the relative error of the ith sample; n is the number of samples; Tb,pred and Tb,obs is the predicted Tb and observed Tb respectively. MDEV index For calculating the MDEV index of a molecule, the whole molecule is regarded as a topological graph. Each non-hydrogen atom is considered as a point and each chemical bond is considered as an edge. The relative electronegative of each chlorine atom and benzene ring is defined as 1. Correspondingly, the MDEV index is defined as Equation 2: In Equation 2, k or l is the type of atoms (k =1 or l =1 denotes the chlorine atom, and k =2 or l =2 denotes the benzene ring); Items i and j are the coding number of a chlorine atom or a benzene ring. Additionally, i and j belong to the kth and lth type respectively. The dik,jl represents the nearest relative distance between the ith and jth atom. For example, di1,j1 indicates the shortest relative distance between the ith and jth chlorine atom. The relative distance between the two adjacent non-hydrogen atoms is defined as d = 1. According to Equation 2, there are three elements, M11, M12 and M22, in the MDEV index for a PCDD/F molecule. For instance, the MDEV index of 2, 3, 7-CDD should be calculated as follows: The MDEV index of 2, 4, 6-CDF should be calculated as following: Artificial neural network ANN14-17,19-29 is a multivariate calibration approach capable of modeling various complex functions. Its basic processing unit is the neuron (node). An ANN comprises a number of neurons organized in different layers. Linear artificial neural network,25-29 is a kind of neural network having no hidden layers, but an output layer with fully linear neurons (that is, linear neurons with linear activation function). It is the simplest ANN and is usually used to develop linear model. It is often used as a good benchmark against which to compare the prediction performance of other methods. Although a number of multivariable calibration problems cannot be solved or solved well by L-ANN, many others can. It is common to find that a problem which was perceived to be difficult and non-linear can actually be solved satisfactorily by using L-ANN. In L-ANN, the neurons between the input and output layers fully connected, while the neurons in the same layer do not. Figure 1 shows the basic architecture of the L-ANN.
Figure 1. The architecture of linear artificial neural network
In Figure 1, xi ( i =1, 2, ..., n), yj ( j =1, 2,... , m) and wij is the input variables, output variables and the element of connection weight matrix W respectively. And bj is the bias vector, which corresponds to the thresholds. The symbol fact( ) means the activation function. Before the training procedure, input and output variables are normalized. When the network is executed, it effectively multiplies the input variables by the weight matrix W, and then adds the bias vector bj. Hence, the post synaptic potential (PSP) function of the neuron should be described as Equation 5: Generally, the activation function used in L-ANN is a linear function: Because there are no non-linear functions and hidden neurons in the network, L-ANN is good at solving linear problems. Actually, training a linear network means finding the optimal value of the weight matrix W to minimize the root mean squared error of the calibration set. In order to reach this goal, the known samples are always divided into two subsets: a training set and a verification set. The network is trained by using the training set, and is tested after each epoch by using the verification set. The training is terminated once deterioration in the root mean squared error of verification set is occurred. The over-fitting and over-learning are avoided in this way. Although the verification set is used to find the best network setting, actually, training algorithms do not use the verification set to adjust network weights. Standard pseudo-inverse linear optimization algorithm26 is usually used to train the network. This algorithm uses the singular value decomposition technique to calculate the pseudo-inverse of the matrix needed to set the weights in the linear output layer, so as to find the least mean squared solution. Essentially, it guarantees to find the optimal setting for the weight matrix in a linear layer. The main difference between MLR and L-ANN is the optimization algorithm. In MLR, the goal of least square algorithm is to find the minimal sum of squared residuals of the training set. As for L-ANN, the goal of training algorithm is to minimize the root mean squared error of verification set.26 Thus, the prediction ability of L-ANN is usually better than that of MLR. Leave-one-out cross-validation Leave-one-out cross-validation15-17,30 is a commonly used algorithm for estimating the predictive performance and robustness of a multivariable calibration model. Usually, practical calibration experiments have to be based on a limited set of available samples. The idea behind the leave-one-out cross validation algorithm is to predict the property value of each sample in turn with the calibration model which is developed from the other samples. When applying the algorithm to a dataset including n samples, the calibration modeling is performed n times, each time using (n-1) samples for modeling and one sample for testing. Hence, the procedure of leave-one-out cross validation can be divided into n segments. In each segment i (i = 1, ... , n), there are three steps: (1) taking sample i out as temporary 'test set', which is not used to establish the calibration model, (2) developing a calibration model with the rest (n-1) samples, (3) testing the established model with sample i, computing and storing the prediction error of the sample. The advantage of leave-one-out cross validation over random sub-sampling is that each sample is used for validation exactly once. Although leave-one-out cross-validation is an effective and commonly used method, there is still the risk of overestimating the predictive performance and robustness of a model when using this method. It is common to use two or more validation methods for estimating a calibration model. The risk of overestimation can be effectively lessened in this way. External validation External validation17,25,30 is an algorithm which has been often used to assess the predictive ability of a calibration model. When using this algorithm, working dataset is split into two subsets: a calibration set, which is used to develop the calibration model, and a test set, which is used to assess the predictive ability of the developed model. Obviously, test set is designed to give an independent assessment of the predictive performance of the developed model. It is not used in developing the model at all, and hence is independent of the calibration set. Generally, the calibration set and test set are randomly selected from the working dataset. Software All the calculation was done by using subroutines developed in MATLAB (Ver.7.0). The computation was performed on a personal computer equipped with an i5-2450M processor.
RESULTS AND DISCUSSION The MDEV index of PCDD/Fs was calculated. The result is listed in Table 1 and Table 3. Clearly, the MDEV index of different PCDD/F molecules is quite different. It is demonstrated that MDEV index can describe the structural differences among these compounds. It is reasonable to use the MDEV index as the structural descriptor to develop the QSPR model of PCDD/Fs.
MLR model Generally, a simple model should always be chosen in preference to a complex model, if the latter does not fit the data better. Thus, we firstly investigated whether MLR is feasible to model the quantitative relationship between the MDEV index and Tb of PCDD/Fs. The MDEV index was used as the independent variable and the Tb was used as the dependent variable to develop the regression model. In order to assess the predictive ability of the developed model, two validation methods, leave-one-out cross validation and external validation, were conducted. The 52 samples shown in Table 1 were randomly split into two groups: Group I, which comprises 42 samples, and Group II, which comprises 10 samples. Leave-one-out cross validation was applied to Group I. The result is presented in Table 2. As shown in Table 2, the predicted Tb is in consistent with the observed Tb. For the 42 compounds, the RMSRE of prediction is 1.77. Moreover, the predicted Tb were plotted versus the observed Tb (as shown in Figure 2a) and the plot shows a linear relationship (y = 0.9904 x + 7.9881 with R = 0.9819) between the predicted and observed Tb. Subsequently, external validation was carried out to further assess the predictive ability of the MLR model. In this procedure, the model was established by using all the 42 compounds in Group I as the calibration set. The obtained regression equation is: Tb = -60.50 M11 + 35.78 M12 - 2.14 M22 + 580.20. The R2, Standard error of the estimate (S.E.) and F value of the regression model is 0.9672, 10.426 and 368.8 respectively. The value of F [F < F0.01 (N, N-3)] indicates that MDEV index is significant to Tb. It is reasonable to develop a regression model between the MDEV index and Tb. The S.E. is significantly smaller than the sample mean of Tb. It is shown that the obtained regression equation fits the data well. Then, the Tb of the samples in Group II was predicted by using the obtained regression equation. The prediction result is shown in Table2. As shown in the table, the predicted Tb is in good accordance with the observed Tb. For the 10 compounds, the prediction RMSRE is 1.23. The plot of predicted Tb versus observed Tb is shown in Figure 2a, which shows a linear relationship (y = 0.9957 x + 0.9347 with R = 0.9864) between the predicted and observed Tb.
Figure 2. Observed Tb versus the predicted Tb of the: (a) MLR model; (b) L-ANN model
The result of the leave-one-out cross validation and external validation demonstrates that the MDEV index of the investigated PCDD/Fs is quantitatively related to their Tb. In previous researches, MDEV index has been just used as the structural descriptor to develop the QSPR model of the compounds which include the same basic structure, such as the boiling points model of alcohols,13 the gas/particle partition coefficient model of PCBs,17 etc.14-16 The basic structure of polychlorinated dibenzo-p-dioxins is different from the basic structure of polychlorinated dibenzofurans. Thus, it is shown that MDEV index can be used as the structural descriptor to establish the QSPR model for the compounds with different basic structures. In addition, the validation result demonstrates that MLR is practicable for modeling the quantitative relationship between the MDEV index and Tb of PCDD/Fs. Obviously, a linear QSPR model based on MDEV index is able to predict the Tb of PCDD/Fs. Thus, an MLR model was developed by using all the 52 PCDD/Fs listed in Table 1. The obtained regression equation is: Tb = -59.68 M11 + 35.49 M12 - 4.99 M22 + 583.46 The R2, S.E. and F value of the regression model is 0.9679, 10.81 and 477.1 respectively. The Tb of the other 53 PCDDs and 107 PCDFs was then predicted by using this regression equation. The result is shown in Table 3. The Tb value of these PCDD/Fs has not been experimentally determined as yet. Thus, our prediction result can be used as an estimation Tb of these compounds. L-ANN model L-ANN is another commonly used linear calibration method in QSPR studies. Thus, we investigated whether a better model can be established by using L-ANN. A 3-1 L-ANN (i.e. 3 input variables and 1 output variable in the network) was used to develop the calibration model. The MDEV index and Tb was used as input and output variables respectively. In each run of ANN, ten samples were randomly selected and used as the verification set. Leave-one-out cross validation and external validation were carried out to assess the prediction performance of the developed model. Group I was still used to complete the leave-one-out cross validation. The result of leave-one-out cross validation is listed in Table 2. As shown in the table, the predicted Tb is in good agreement with the observed Tb. For the 42 compounds, the RMSRE of prediction is 1.65. The predicted Tb were plotted versus the observed Tb (shown in Figure 2b) and the plot shows a linear relationship (y = 0.9893 x + 9.0119 with R= 0.9847) between the predicted and observed Tb. Then, all the 52 samples were used to complete the external validation. An L-ANN model was developed by using the 42 samples of Group I as the calibration set. In the training procedure, verification set comprises ten randomly selected samples. The Tb of the samples in Group II was predicted by using the obtained network. The result of external validation is also shown in Table 2. Obviously, the predicted Tb is also in good agreement with the observed Tb. For the ten samples, the prediction RMSRE is 1.16. The plot of predicted Tb versus observed Tb (shown in Figure 2b) shows that there is a linear relationship (y = 0.9966x + 0.9747 with R=0.9875) between the predicted and observed Tb. Obviously, the prediction accuracy of the L-ANN model is slightly higher than that of the MLR model. Using L-ANN is slightly better than MLR in modeling the quantitative relationship between the MDEV index and Tb of PCDD/Fs. It is demonstrated that L-ANN is a practicable and promising method for predicting the Tb of PCDD/Fs. Thus, a 3-1 L-ANN model was developed by using all the 52 PCDD/Fs listed in Table 1. In the training procedure, 13 samples were randomly selected and used as the verification set. The Tb of the other 53 PCDDs and 107 PCDFs was then predicted by using this model. The result is also listed in Table 3. Certainly, this prediction result can also be used as an estimation of the Tb of these compounds and should be slightly better than the prediction result of MLR model.
CONCLUSIONS The QSPR model for predicting the boiling point of PCDD/Fs was investigated. The MDEV index was used as structural descriptor of PCDD/Fs. Both MLR model and L-ANN model were developed and investigated. The predictive ability of the developed models was assessed by leave-one-out cross validation and external validation. The validation result indicates that both MLR model and L-ANN model are practicable for predicting the Tb of PCDD/Fs. It is demonstrated that MDEV index of PCDD/Fs is quantitatively related to the Tb of PCDD/Fs. MDEV index can be calculated easily. It is easy and convenient to develop the QSPR model for the Tb of PCDD/Fs based on the MDEV index. In addition, the validation result demonstrates that both MLR and L-ANN are practicable for modeling the quantitative relationship between the MDEV index and Tb of PCDD/Fs. It is reasonable to predict the Tb of PCDD/Fs by using the established models. Thus, the Tb of each PCDD/F congener was predicted by using the developed models. The predicted Tb can be used as an estimation of the boiling point of PCDD/Fs.
ACKNOWLEDGEMENTS The work was supported by the National Natural Science Foundation of China No. 21305108 and No. 21375105, the Project Supported by Natural Science Basic Research Plan in Shaanxi Province of China (Program No. 2014JM2039) and the Innovative Research Team of Xi'an Shiyou University (No. 2013QNKYCXTD01).
REFERENCES 1. Kjeller, L.; Jones, K. C.; Ton, A. E. J.; Rappe, C.; Environ. Sci. Technol. 1996, 30, 1398. DOI: http://dx.doi.org/10.1021/es950708r 2. Moon, M. H.; Kim, H. J.; Kwon, S. Y.; Lee, S. J.; Chang, Y. S.; Lim, H.; Anal. Chem. 2004, 76, 3236. DOI: http://dx.doi.org/10.1021/ac049968u PMID: 15167807 3. Mclanchlan, M. S.; Sewart, A. P.; Bacon, J. R.; Jones, K. C.; Environ. Sci. Technol. 1996, 30, 2567. DOI: http://dx.doi.org/10.1021/es950932g 4. Atkinson, J. D.; Hung, P. C.; Zhang, Z.; Chang, M. B.; Yan, Z.; Rood, M. J.; Chemosphere 2015, 118, 136. DOI: http://dx.doi.org/10.1016/j.chemosphere.2014.07.055 PMID: 25150825 5. Do, L.; Liljelind, P.; Zhang, J.; Haglund, P.; J. Chromatogr. A 2013, 1311, 157. DOI: http://dx.doi.org/10.1016/j.chroma.2013.08.070 PMID: 24016718 6. Rordorf, B. F.; Chemosphere 1989, 18, 783. DOI: http://dx.doi.org/10.1016/0045-6535(89)90230-0 7. Admire, B.; Lian, B.; Yalkowsky, S. H.; Chemosphere 2014, http://dx.doi.org/10.1016/ j.chemosphere.2014.06.053. 8. Lian, B.; Yalkowsky, S. H.; J. Chem. Thermodyn. 2012, 54, 250. DOI: http://dx.doi.org/10.1016/j.jct.2012.04.009 9. Lian, B.; Yalkowsky, S. H.; Ind. Eng. Chem. Res. 2012, 51, 16750. DOI: http://dx.doi.org/10.1021/ie302574y 10. Junkes, B. da S.; Amboni, R. D. de M. C.; Yunes, R. A.; Heinzen, V. E. F.; J. Braz. Chem. Soc. 2004, 15, 183. 11. Gutman, I.; Tosovic, J.; J. Serb. Chem. Soc. 2013, 78, 805. DOI: http://dx.doi.org/10.2298/JSC121002134G 12. Liu, S. S.; Liu, H. L.; Xia, Z. N.; Cao, C. Z.; Li, Z. L.; J. Chem. Inf. Comput. Sci. 1999, 39, 951. DOI: http://dx.doi.org/10.1021/ci980097x 13. Yin, C. S.; Guo, W. M.; Lin, T.; Liu, S. S.; Fu, R. Q.; Pan, Z. X.; Wang, L. S.; J. Chin. Chem. Soc. 2001, 148, 739. DOI: http://dx.doi.org/10.1002/jccs.200100106 14. Liu, H. H.; Xiao, X.; Qin, J.; Liu, Y. M.; J. Chongqing Inst. Technol. (In Chinese) 2005, 19, 67. 15. Jiao, L.; Wang, X. F.; Bing, S.; Xue, Z. W.; Li, H.; J. Serb. Chem. Soc. (2014), doi: 10.2298/ JSC140716087J. DOI: http://dx.doi.org/10.2298/ JSC140716087J 16. Jiao, L.; Xue, Z. W.; Wang, G. F.; Wang, X. F.; Li, H.; Chemom. Intell. Lab. Syst. 2014, 137, 91. DOI: http://dx.doi.org/10.1016/j.chemolab.2014.06.015 17. Jiao, L.; Wang, X. F.; Li, H.; Wang, Y. X.; J. Serb. Chem. Soc. 2014, 79, 965. DOI: http://dx.doi.org/10.2298/JSC130611152J 18. Mackay, D.; Shiu, W. Y.; Ma, K. C.; Lee, S. C.; Physical-Chemical Properties and Environmental Fate for Organic Chemicals, 2nd ed., CRC Press: Boca Raton, 2006. 19. Jiao, L.; Chemosphere 2010, 80, 671. DOI: http://dx.doi.org/10.1016/j.chemosphere.2010.04.013 PMID: 20452639 20. Galao, O. F.; Borsato, D.; Pinto, J. P.; Visentainer, J. V.; Carrao-Panizzi, M. C.; J. Braz. Chem. Soc. 2011, 1, 142. DOI: http://dx.doi.org/10.1590/S0103-50532011000100019 21. Fatemi, M. H.; Chahi, Z. G.; SAR QSAR Environ. Res. 2012, 23, 155. DOI: http://dx.doi.org/10.1080/1062936X.2011.645876 PMID: 22224473 22. Noorizadeh, H.; Farmany, A.; Noorizadeh M.; Quim. Nova 2011, 34, 242. DOI: http://dx.doi.org/10.1590/S0100-40422011000200014 23. Nunes, C. A.; Lima, C. F.; Barbosa, L. C. A.; Quim. Nova 2011, 34, 279. DOI: http://dx.doi.org/10.1590/S0100-40422011000200020 24. Fatemi, M.; Ghorbannezhad, Z.; J. Serb. Chem. Soc. 2011, 76, 1003. DOI: http://dx.doi.org/10.2298/JSC101104091F 25. Jiao, L.; Li, H.; Chemom. Intell. Lab. Syst. 2010, 103, 90. DOI: http://dx.doi.org/10.1016/j.chemolab.2010.05.019 26. http://www.statsoft.com/textbook/neural-networks, accessed September, 2014. 27. Zhang, Y. X.; Li, H.; Hou, A. X.; Havel, J.; Chemom. Intell. Lab. Syst. 2006, 82, 165. DOI: http://dx.doi.org/10.1016/j.chemolab.2005.08.012 28. Yin, C. S.; Shen, Y.; Liu, S. S.; Yin, Q. S.; Guo, W. M.; Pan, Z. X.; Comput. Chem. 2001, 25, 239. DOI: http://dx.doi.org/10.1016/S0097-8485(00)00097-8 PMID: 11339406 29. Zhang, W. J.; Zhong, X. Q.; Liu, G. H.; Stochastic Environmental Research and Risk Assessment 2008, 22, 207. DOI: http://dx.doi.org/10.1007/s00477-007-0108-3 30. Martens, H. A.; Dardenne, P.; Chemom. Intell. Lab. Syst. 1998, 44, 99. DOI: http://dx.doi.org/10.1016/S0169-7439(98)00167-1 |
On-line version ISSN 1678-7064 Printed version ISSN 0100-4042
Qu�mica Nova
Publica��es da Sociedade Brasileira de Qu�mica
Caixa Postal: 26037
05513-970 S�o Paulo - SP
Tel/Fax: +55.11.3032.2299/+55.11.3814.3602
Free access