Long Jiao<sup>I,*</sup>; Xiaofei Wang<sup>I</sup>; Shan Bing<sup>I</sup>; Zhiwei Xue<sup>II</sup>; Hua Li<sup>III</sup>

doi:10.5935/0100-4042.20150025

JBCS

Sociedade Brasileira de Quï¿½mica

0:27, qui out 23

Acesso Aberto/TP

Artigos Recentes

Edição Atual

Número Especial - Microscopia Eletrônica e Microanálises

Edições Anteriores

Prêmio QN para Jovens Autores

Galeria de Capas

Normas éticas

Artigo

Imprimir
PDF

Predicting the boiling point of PCDD/Fs by the QSPR method based on the molecular distance-edge vector index

Long Jiao^I,*; Xiaofei Wang^I; Shan Bing^I; Zhiwei Xue^II; Hua Li^III

^ICollege of Chemistry and Chemical Engineering, Xi'an Shiyou University, Xi'an 710065, China
^IINo.203 Research lnstitute of Nuclear industry, Xianyang, 712000, China
^IIICollege of Chemistry and Materials Science, Northwest University, Xi'an 710069, China

Recebido em 17/10/2014
Aceito em 17/12/2014
Publicado na web em 05/03/2015

Endereço para correspondência

*e-mail: mop@xsyu.edu.cn

RESUMO

The quantitative structure property relationship (QSPR) for the boiling point (T_b) of polychlorinated dibenzo-p-dioxins and polychlorinated dibenzofurans (PCDD/Fs) was investigated. The molecular distance-edge vector (MDEV) index was used as the structural descriptor. The quantitative relationship between the MDEV index and T_b was modeled by using multivariate linear regression (MLR) and artificial neural network (ANN), respectively. Leave-one-out cross validation and external validation were carried out to assess the prediction performance of the models developed. For the MLR method, the prediction root mean square relative error (RMSRE) of leave-one-out cross validation and external validation was 1.77 and 1.23, respectively. For the ANN method, the prediction RMSRE of leave-one-out cross validation and external validation was 1.65 and 1.16, respectively. A quantitative relationship between the MDEV index and T_b of PCDD/Fs was demonstrated. Both MLR and ANN are practicable for modeling this relationship. The MLR model and ANN model developed can be used to predict the T_b of PCDD/Fs. Thus, the T_b of each PCDD/F was predicted by the developed models.

Palavras-chave: QSPR; molecular distance-edge vector index; PCDD/Fs; boiling point.

INTRODUCTION

Polychlorinated dibenzo-p-dioxins and polychlorinated dibenzofurans (PCDD/Fs) are two series of persistent organic pollutants which have been detected in almost all compartments of the global ecosystem. These chemicals have gained much attention due to their toxicity, environmental persistence, tendency to accumulate through the food chain, and the risk to human health. PCDD/Fs are not produced intentionally and do not serve any useful purpose. They are formed as byproducts of many industrial and combustion processes. PCDD/Fs are semi-volatile compounds. After released into the atmosphere, they are likely to transfer to other environmental compartments such as soil, water, sediments and their resident biota where they can last for years before degradation.^1-5

The boiling point (T_b) is an important property for studying the volatility of PCDD/Fs, which is correlated with the fate, transport, and transformation of PCDD/Fs in the environment. Boiling point is also a significant factor in determining physico-chemical properties of PCDD/Fs, such as vapor pressure, octanol/water partitioning coefficient and aqueous solubility.^6-9 A quantitative study on the T_b is necessary to understand the environmental behavior of PCDD/Fs. Experimentally determining the T_b of PCDD/Fs is still a hard work because of the complexity of analytical methods, high cost of experiments and lack of the standards.⁷ In addition, the measurement of boiling point of PCDD/Fs is hazardous due to the high vapor pressures involved.⁶ Up to now, the T_b has not been experimentally determined for each PCDD/F congener.

Quantitative structure property relationship (QSPR) method is safe, fast, convenient and cost-effective for predicting the property of compounds. Therefore, it is worthwhile to develop an accurate and easy-to-use QSPR model for predicting the T_b of PCDD/Fs. Topological index is a kind of structural descriptor which is often used in QSPR researches. It can efficiently describe the structure of a molecule without detailed molecular orbital calculations. It is useful because, despite its mathematical simplicity, topological index is able to differentiate molecules with different structures.^10,11 The aim of this work is developing the QSPR model for the T_b of PCDD/Fs based on the topological index. Molecular distance-edge vector (MDEV) index^12-17 was used as the structural descriptor of PCDD/Fs. Multivariate linear regression (MLR) and linear artificial neural network (L-ANN) were employed to model the quantitative relationship between the T_b and MDEV index of PCDD/Fs.

EXPERIMENTAL

Data set

The MDEV index was calculated according to the approach presented in the followed section. The MDEV index of the 52 PCDD/Fs, of which the T_b value is known, is listed in Table 1. The observed T_b value of these PCDD/Fs was taken from the references^7,18 and listed in Table 2.

Root mean square relative error (RMSRE) was used to indicate the prediction performance of the developed models. The RMSRE is defined as Equation 1:

where RE_i is the relative error of the ith sample; n is the number of samples; T_b,pred and T_b,obs is the predicted T_b and observed T_b respectively.

MDEV index

For calculating the MDEV index of a molecule, the whole molecule is regarded as a topological graph. Each non-hydrogen atom is considered as a point and each chemical bond is considered as an edge. The relative electronegative of each chlorine atom and benzene ring is defined as 1. Correspondingly, the MDEV index is defined as Equation 2:

In Equation 2, k or l is the type of atoms (k =1 or l =1 denotes the chlorine atom, and k =2 or l =2 denotes the benzene ring); Items i and j are the coding number of a chlorine atom or a benzene ring. Additionally, i and j belong to the kth and lth type respectively. The d_ik,jl represents the nearest relative distance between the ith and jth atom. For example, d_i₁_,_j₁ indicates the shortest relative distance between the ith and jth chlorine atom. The relative distance between the two adjacent non-hydrogen atoms is defined as d = 1. According to Equation 2, there are three elements, M₁₁, M₁₂ and M₂₂, in the MDEV index for a PCDD/F molecule. For instance, the MDEV index of 2, 3, 7-CDD should be calculated as follows:

The MDEV index of 2, 4, 6-CDF should be calculated as following:

Artificial neural network

ANN^14-17,19-29 is a multivariate calibration approach capable of modeling various complex functions. Its basic processing unit is the neuron (node). An ANN comprises a number of neurons organized in different layers. Linear artificial neural network,^25-29 is a kind of neural network having no hidden layers, but an output layer with fully linear neurons (that is, linear neurons with linear activation function). It is the simplest ANN and is usually used to develop linear model. It is often used as a good benchmark against which to compare the prediction performance of other methods. Although a number of multivariable calibration problems cannot be solved or solved well by L-ANN, many others can. It is common to find that a problem which was perceived to be difficult and non-linear can actually be solved satisfactorily by using L-ANN.

In L-ANN, the neurons between the input and output layers fully connected, while the neurons in the same layer do not. Figure 1 shows the basic architecture of the L-ANN.

Figure 1. The architecture of linear artificial neural network

In Figure 1, x_i ( i =1, 2, ..., n), y_j ( j =1, 2,... , m) and w_ij is the input variables, output variables and the element of connection weight matrix W respectively. And b_j is the bias vector, which corresponds to the thresholds. The symbol fact( ) means the activation function. Before the training procedure, input and output variables are normalized. When the network is executed, it effectively multiplies the input variables by the weight matrix W, and then adds the bias vector b_j. Hence, the post synaptic potential (PSP) function of the neuron should be described as Equation 5:

Generally, the activation function used in L-ANN is a linear function:

Because there are no non-linear functions and hidden neurons in the network, L-ANN is good at solving linear problems. Actually, training a linear network means finding the optimal value of the weight matrix W to minimize the root mean squared error of the calibration set. In order to reach this goal, the known samples are always divided into two subsets: a training set and a verification set. The network is trained by using the training set, and is tested after each epoch by using the verification set. The training is terminated once deterioration in the root mean squared error of verification set is occurred. The over-fitting and over-learning are avoided in this way. Although the verification set is used to find the best network setting, actually, training algorithms do not use the verification set to adjust network weights. Standard pseudo-inverse linear optimization algorithm²⁶ is usually used to train the network. This algorithm uses the singular value decomposition technique to calculate the pseudo-inverse of the matrix needed to set the weights in the linear output layer, so as to find the least mean squared solution. Essentially, it guarantees to find the optimal setting for the weight matrix in a linear layer.

The main difference between MLR and L-ANN is the optimization algorithm. In MLR, the goal of least square algorithm is to find the minimal sum of squared residuals of the training set. As for L-ANN, the goal of training algorithm is to minimize the root mean squared error of verification set.²⁶ Thus, the prediction ability of L-ANN is usually better than that of MLR.

Leave-one-out cross-validation

Leave-one-out cross-validation^15-17,30 is a commonly used algorithm for estimating the predictive performance and robustness of a multivariable calibration model. Usually, practical calibration experiments have to be based on a limited set of available samples. The idea behind the leave-one-out cross validation algorithm is to predict the property value of each sample in turn with the calibration model which is developed from the other samples. When applying the algorithm to a dataset including n samples, the calibration modeling is performed n times, each time using (n-1) samples for modeling and one sample for testing. Hence, the procedure of leave-one-out cross validation can be divided into n segments. In each segment i (i = 1, ... , n), there are three steps: (1) taking sample i out as temporary 'test set', which is not used to establish the calibration model, (2) developing a calibration model with the rest (n-1) samples, (3) testing the established model with sample i, computing and storing the prediction error of the sample. The advantage of leave-one-out cross validation over random sub-sampling is that each sample is used for validation exactly once. Although leave-one-out cross-validation is an effective and commonly used method, there is still the risk of overestimating the predictive performance and robustness of a model when using this method. It is common to use two or more validation methods for estimating a calibration model. The risk of overestimation can be effectively lessened in this way.

External validation

External validation^17,25,30 is an algorithm which has been often used to assess the predictive ability of a calibration model. When using this algorithm, working dataset is split into two subsets: a calibration set, which is used to develop the calibration model, and a test set, which is used to assess the predictive ability of the developed model. Obviously, test set is designed to give an independent assessment of the predictive performance of the developed model. It is not used in developing the model at all, and hence is independent of the calibration set. Generally, the calibration set and test set are randomly selected from the working dataset.

Software

All the calculation was done by using subroutines developed in MATLAB (Ver.7.0). The computation was performed on a personal computer equipped with an i5-2450M processor.

RESULTS AND DISCUSSION

The MDEV index of PCDD/Fs was calculated. The result is listed in Table 1 and Table 3. Clearly, the MDEV index of different PCDD/F molecules is quite different. It is demonstrated that MDEV index can describe the structural differences among these compounds. It is reasonable to use the MDEV index as the structural descriptor to develop the QSPR model of PCDD/Fs.

MLR model

Generally, a simple model should always be chosen in preference to a complex model, if the latter does not fit the data better. Thus, we firstly investigated whether MLR is feasible to model the quantitative relationship between the MDEV index and T_b of PCDD/Fs. The MDEV index was used as the independent variable and the T_b was used as the dependent variable to develop the regression model. In order to assess the predictive ability of the developed model, two validation methods, leave-one-out cross validation and external validation, were conducted. The 52 samples shown in Table 1 were randomly split into two groups: Group I, which comprises 42 samples, and Group II, which comprises 10 samples.

Leave-one-out cross validation was applied to Group I. The result is presented in Table 2. As shown in Table 2, the predicted T_b is in consistent with the observed T_b. For the 42 compounds, the RMSRE of prediction is 1.77. Moreover, the predicted T_b were plotted versus the observed T_b (as shown in Figure 2a) and the plot shows a linear relationship (y = 0.9904 x + 7.9881 with R = 0.9819) between the predicted and observed T_b. Subsequently, external validation was carried out to further assess the predictive ability of the MLR model. In this procedure, the model was established by using all the 42 compounds in Group I as the calibration set. The obtained regression equation is: T_b = -60.50 M₁₁ + 35.78 M₁₂ - 2.14 M₂₂ + 580.20. The R², Standard error of the estimate (S.E.) and F value of the regression model is 0.9672, 10.426 and 368.8 respectively. The value of F [F < F_0.01 (N, N-3)] indicates that MDEV index is significant to T_b. It is reasonable to develop a regression model between the MDEV index and T_b. The S.E. is significantly smaller than the sample mean of T_b. It is shown that the obtained regression equation fits the data well. Then, the T_b of the samples in Group II was predicted by using the obtained regression equation. The prediction result is shown in Table2. As shown in the table, the predicted T_b is in good accordance with the observed T_b. For the 10 compounds, the prediction RMSRE is 1.23. The plot of predicted T_b versus observed T_b is shown in Figure 2a, which shows a linear relationship (y = 0.9957 x + 0.9347 with R = 0.9864) between the predicted and observed T_b.

Figure 2. Observed T_b versus the predicted T_b of the: (a) MLR model; (b) L-ANN model

The result of the leave-one-out cross validation and external validation demonstrates that the MDEV index of the investigated PCDD/Fs is quantitatively related to their T_b. In previous researches, MDEV index has been just used as the structural descriptor to develop the QSPR model of the compounds which include the same basic structure, such as the boiling points model of alcohols,¹³ the gas/particle partition coefficient model of PCBs,¹⁷ etc.^14-16 The basic structure of polychlorinated dibenzo-p-dioxins is different from the basic structure of polychlorinated dibenzofurans. Thus, it is shown that MDEV index can be used as the structural descriptor to establish the QSPR model for the compounds with different basic structures. In addition, the validation result demonstrates that MLR is practicable for modeling the quantitative relationship between the MDEV index and T_b of PCDD/Fs. Obviously, a linear QSPR model based on MDEV index is able to predict the T_b of PCDD/Fs. Thus, an MLR model was developed by using all the 52 PCDD/Fs listed in Table 1. The obtained regression equation is: T_b = -59.68 M₁₁ + 35.49 M₁₂ - 4.99 M₂₂ + 583.46 The R², S.E. and F value of the regression model is 0.9679, 10.81 and 477.1 respectively. The T_b of the other 53 PCDDs and 107 PCDFs was then predicted by using this regression equation. The result is shown in Table 3. The T_b value of these PCDD/Fs has not been experimentally determined as yet. Thus, our prediction result can be used as an estimation T_b of these compounds.

L-ANN model

L-ANN is another commonly used linear calibration method in QSPR studies. Thus, we investigated whether a better model can be established by using L-ANN. A 3-1 L-ANN (i.e. 3 input variables and 1 output variable in the network) was used to develop the calibration model. The MDEV index and T_b was used as input and output variables respectively. In each run of ANN, ten samples were randomly selected and used as the verification set. Leave-one-out cross validation and external validation were carried out to assess the prediction performance of the developed model. Group I was still used to complete the leave-one-out cross validation. The result of leave-one-out cross validation is listed in Table 2. As shown in the table, the predicted T_b is in good agreement with the observed T_b. For the 42 compounds, the RMSRE of prediction is 1.65. The predicted T_b were plotted versus the observed T_b (shown in Figure 2b) and the plot shows a linear relationship (y = 0.9893 x + 9.0119 with R= 0.9847) between the predicted and observed T_b. Then, all the 52 samples were used to complete the external validation. An L-ANN model was developed by using the 42 samples of Group I as the calibration set. In the training procedure, verification set comprises ten randomly selected samples. The T_b of the samples in Group II was predicted by using the obtained network. The result of external validation is also shown in Table 2. Obviously, the predicted T_b is also in good agreement with the observed T_b. For the ten samples, the prediction RMSRE is 1.16. The plot of predicted T_b versus observed T_b (shown in Figure 2b) shows that there is a linear relationship (y = 0.9966x + 0.9747 with R=0.9875) between the predicted and observed T_b. Obviously, the prediction accuracy of the L-ANN model is slightly higher than that of the MLR model. Using L-ANN is slightly better than MLR in modeling the quantitative relationship between the MDEV index and T_b of PCDD/Fs. It is demonstrated that L-ANN is a practicable and promising method for predicting the T_b of PCDD/Fs. Thus, a 3-1 L-ANN model was developed by using all the 52 PCDD/Fs listed in Table 1. In the training procedure, 13 samples were randomly selected and used as the verification set. The T_b of the other 53 PCDDs and 107 PCDFs was then predicted by using this model. The result is also listed in Table 3. Certainly, this prediction result can also be used as an estimation of the T_b of these compounds and should be slightly better than the prediction result of MLR model.

CONCLUSIONS

The QSPR model for predicting the boiling point of PCDD/Fs was investigated. The MDEV index was used as structural descriptor of PCDD/Fs. Both MLR model and L-ANN model were developed and investigated. The predictive ability of the developed models was assessed by leave-one-out cross validation and external validation. The validation result indicates that both MLR model and L-ANN model are practicable for predicting the T_b of PCDD/Fs. It is demonstrated that MDEV index of PCDD/Fs is quantitatively related to the T_b of PCDD/Fs. MDEV index can be calculated easily. It is easy and convenient to develop the QSPR model for the T_b of PCDD/Fs based on the MDEV index. In addition, the validation result demonstrates that both MLR and L-ANN are practicable for modeling the quantitative relationship between the MDEV index and T_b of PCDD/Fs. It is reasonable to predict the T_b of PCDD/Fs by using the established models. Thus, the T_b of each PCDD/F congener was predicted by using the developed models. The predicted T_b can be used as an estimation of the boiling point of PCDD/Fs.

ACKNOWLEDGEMENTS

The work was supported by the National Natural Science Foundation of China No. 21305108 and No. 21375105, the Project Supported by Natural Science Basic Research Plan in Shaanxi Province of China (Program No. 2014JM2039) and the Innovative Research Team of Xi'an Shiyou University (No. 2013QNKYCXTD01).

REFERENCES

1. Kjeller, L.; Jones, K. C.; Ton, A. E. J.; Rappe, C.; Environ. Sci. Technol. 1996, 30, 1398. DOI: http://dx.doi.org/10.1021/es950708r

2. Moon, M. H.; Kim, H. J.; Kwon, S. Y.; Lee, S. J.; Chang, Y. S.; Lim, H.; Anal. Chem. 2004, 76, 3236. DOI: http://dx.doi.org/10.1021/ac049968u PMID: 15167807

3. Mclanchlan, M. S.; Sewart, A. P.; Bacon, J. R.; Jones, K. C.; Environ. Sci. Technol. 1996, 30, 2567. DOI: http://dx.doi.org/10.1021/es950932g

4. Atkinson, J. D.; Hung, P. C.; Zhang, Z.; Chang, M. B.; Yan, Z.; Rood, M. J.; Chemosphere 2015, 118, 136. DOI: http://dx.doi.org/10.1016/j.chemosphere.2014.07.055 PMID: 25150825

5. Do, L.; Liljelind, P.; Zhang, J.; Haglund, P.; J. Chromatogr. A 2013, 1311, 157. DOI: http://dx.doi.org/10.1016/j.chroma.2013.08.070 PMID: 24016718

6. Rordorf, B. F.; Chemosphere 1989, 18, 783. DOI: http://dx.doi.org/10.1016/0045-6535(89)90230-0

7. Admire, B.; Lian, B.; Yalkowsky, S. H.; Chemosphere 2014, http://dx.doi.org/10.1016/ j.chemosphere.2014.06.053.

8. Lian, B.; Yalkowsky, S. H.; J. Chem. Thermodyn. 2012, 54, 250. DOI: http://dx.doi.org/10.1016/j.jct.2012.04.009

9. Lian, B.; Yalkowsky, S. H.; Ind. Eng. Chem. Res. 2012, 51, 16750. DOI: http://dx.doi.org/10.1021/ie302574y

10. Junkes, B. da S.; Amboni, R. D. de M. C.; Yunes, R. A.; Heinzen, V. E. F.; J. Braz. Chem. Soc. 2004, 15, 183.

11. Gutman, I.; Tosovic, J.; J. Serb. Chem. Soc. 2013, 78, 805. DOI: http://dx.doi.org/10.2298/JSC121002134G

12. Liu, S. S.; Liu, H. L.; Xia, Z. N.; Cao, C. Z.; Li, Z. L.; J. Chem. Inf. Comput. Sci. 1999, 39, 951. DOI: http://dx.doi.org/10.1021/ci980097x

13. Yin, C. S.; Guo, W. M.; Lin, T.; Liu, S. S.; Fu, R. Q.; Pan, Z. X.; Wang, L. S.; J. Chin. Chem. Soc. 2001, 148, 739. DOI: http://dx.doi.org/10.1002/jccs.200100106

14. Liu, H. H.; Xiao, X.; Qin, J.; Liu, Y. M.; J. Chongqing Inst. Technol. (In Chinese) 2005, 19, 67.

15. Jiao, L.; Wang, X. F.; Bing, S.; Xue, Z. W.; Li, H.; J. Serb. Chem. Soc. (2014), doi: 10.2298/ JSC140716087J. DOI: http://dx.doi.org/10.2298/ JSC140716087J

16. Jiao, L.; Xue, Z. W.; Wang, G. F.; Wang, X. F.; Li, H.; Chemom. Intell. Lab. Syst. 2014, 137, 91. DOI: http://dx.doi.org/10.1016/j.chemolab.2014.06.015

17. Jiao, L.; Wang, X. F.; Li, H.; Wang, Y. X.; J. Serb. Chem. Soc. 2014, 79, 965. DOI: http://dx.doi.org/10.2298/JSC130611152J

18. Mackay, D.; Shiu, W. Y.; Ma, K. C.; Lee, S. C.; Physical-Chemical Properties and Environmental Fate for Organic Chemicals, 2^nd ed., CRC Press: Boca Raton, 2006.

19. Jiao, L.; Chemosphere 2010, 80, 671. DOI: http://dx.doi.org/10.1016/j.chemosphere.2010.04.013 PMID: 20452639

20. Galao, O. F.; Borsato, D.; Pinto, J. P.; Visentainer, J. V.; Carrao-Panizzi, M. C.; J. Braz. Chem. Soc. 2011, 1, 142. DOI: http://dx.doi.org/10.1590/S0103-50532011000100019

21. Fatemi, M. H.; Chahi, Z. G.; SAR QSAR Environ. Res. 2012, 23, 155. DOI: http://dx.doi.org/10.1080/1062936X.2011.645876 PMID: 22224473

22. Noorizadeh, H.; Farmany, A.; Noorizadeh M.; Quim. Nova 2011, 34, 242. DOI: http://dx.doi.org/10.1590/S0100-40422011000200014

23. Nunes, C. A.; Lima, C. F.; Barbosa, L. C. A.; Quim. Nova 2011, 34, 279. DOI: http://dx.doi.org/10.1590/S0100-40422011000200020

24. Fatemi, M.; Ghorbannezhad, Z.; J. Serb. Chem. Soc. 2011, 76, 1003. DOI: http://dx.doi.org/10.2298/JSC101104091F

25. Jiao, L.; Li, H.; Chemom. Intell. Lab. Syst. 2010, 103, 90. DOI: http://dx.doi.org/10.1016/j.chemolab.2010.05.019

26. http://www.statsoft.com/textbook/neural-networks, accessed September, 2014.

27. Zhang, Y. X.; Li, H.; Hou, A. X.; Havel, J.; Chemom. Intell. Lab. Syst. 2006, 82, 165. DOI: http://dx.doi.org/10.1016/j.chemolab.2005.08.012

28. Yin, C. S.; Shen, Y.; Liu, S. S.; Yin, Q. S.; Guo, W. M.; Pan, Z. X.; Comput. Chem. 2001, 25, 239. DOI: http://dx.doi.org/10.1016/S0097-8485(00)00097-8 PMID: 11339406

29. Zhang, W. J.; Zhong, X. Q.; Liu, G. H.; Stochastic Environmental Research and Risk Assessment 2008, 22, 207. DOI: http://dx.doi.org/10.1007/s00477-007-0108-3

30. Martens, H. A.; Dardenne, P.; Chemom. Intell. Lab. Syst. 1998, 44, 99. DOI: http://dx.doi.org/10.1016/S0169-7439(98)00167-1

Voltar

Topo

On-line version ISSN 1678-7064 Printed version ISSN 0100-4042
Quï¿½mica Nova
Publicaï¿½ï¿½es da Sociedade Brasileira de Quï¿½mica
Caixa Postal: 26037 05513-970 Sï¿½o Paulo - SP
Tel/Fax: +55.11.3032.2299/+55.11.3814.3602
Free access

GN1