Description of normalization methods

Auto Scaling (Unit Variance Scaling, UV) is one of the simplest methods adjusting metabolic variances, which scales metabolic signals based on the standard deviation of metabolomics data1. This method scales all metabolites to unit variance, and all metabolites are equally important and comparably scaled2. The data is analyzed on the basis of correlations and standard deviation of all metabolites is one after auto scaling1. But the disadvantage of auto scaling is that analytical errors may be amplified due to dilution effects1. Auto scaling has been used to improve the diagnosis of bladder cancer using gas sensor arrays3 and to identify urinary nucleoside markers from urogenital cancer patients by mass spectrometry (MS)-based metabolomics4.

CCMN (Cross-Contribution Compensating Multiple Standard Normalization, CRMN) is applicable to monitor systematic error from randomized and designed experiments using multiple internal standards5. CCMN compensates for systematic cross-contribution effects that can be traced back to a linear association with experimental design5, and is superior at purifying the signal of interest using multiple internal standards5. But care needs to be taken when normalizing the data using the factors of interest prior to carrying out unsupervised analysis6. CCMN is mainly aimed at MS-based metabolomics data and its inclusion will improve the precision of current metabolite profiling protocols7.

Contrast (Contrast Normalization) comes from the integration of MA-plots and logged Bland-Altman plots, which assumes the presence of non-linear biases1. The input data is logged and transformed into a contrast space by means of an orthonormal transformation matrix1. But the use of a log function in this method may impede the processing of zeros and negative numbers, which requires the conversion of non-positive numbers to an extremely small value1. The contrast method has been applied in oligonucleotide arrays to normalizing feature intensities8 and also employed to reveal the role of polychlorinated biphenyls in non-alcoholic fatty liver disease of MS-based metabolic profiling9.

Cubic Splines is one of the non-linear baseline methods assuming the existence of non-linear relationships between baseline and individual spectra1. Like quantile normalization, cubic splines aims to make the distribution of the metabolite concentrations similar across all samples10. The geometric or arithmetic mean of the concentrations of each metabolite across all samples is regarded as the baseline sample10. A set of evenly distributed quantiles from both the baseline and target samples is used to fit a smooth cubic spline10. Finally, a spline function generator uses the generated set of interpolated splines to fit the parameters of a natural cubic spline10. Cubic splines has been adopted to reduce variability in DNA microarray experiments by normalizing all signal channels to a target array11. Moreover, it has been applied in MS-based metabolomics profiling enabling to improve the comprehensiveness of global metabolic profiling of body fluids12.

Similar to contrast normalization, Cyclic Loess (Cyclic Locally Weighted Regression) originates also from the combination of MA-plot and logged Bland-Altman plot by assuming the existence of non-linear bias1, and can estimate a regression surface using multivariate smoothing procedure13. However, cyclic loess is one of the most time-consuming one among the normalization methods, and the amount of time grows exponentially as the number of sample increases14. Cyclic loess has been applied in MS-based metabolomics profiling, revealing that this method was able to remove the systematic effect15.

EigenMS removes bias of unknown complexity from the Liquid Chromatography coupled with Mass Spectrometry (LC/MS)-based metabolomics data, allowing for increased sensitivity in differential analysis. EigenMS normalization aims at preserving the original differences while removing the bias from the data16. It works by 3 steps17: (1) EigenMS preserves the true differences in the metabolomics data by estimating treatment effects with an ANOVA model; (2) singular value decomposition of the residuals matrix is used to determine bias trends in the data; (3) the number of bias trends is estimated via a permutation test and the effects of the bias trends are eliminated. EigenMS has applied in MS-based quantitative label-free proteomics profiling16 and MS-based metabolomics analysis17.

Level Scaling transforms metabolic signal variation into variation relative to the average metabolic signal by scaling according to the mean signal, so the resulting values are changes in percentages compared to the mean concentration18. This method is especially suitable for the circumstances when huge relative variations are of great interest (e.g., studying the stress responses)18. Level scaling is used for identification of biomarkers focusing on relative response, but the disadvantage of it is the inflation of the measurement errors18. Level scaling has been used to identify urinary nucleoside markers from urogenital cancer patients in MS-based metabolomics analysis4.

Linear Baseline (Linear Baseline Scaling) maps each spectrum to the baseline based on the assumption of a constant linear relationship between each feature of a given spectrum and the baseline1. The baseline is the median of each feature across all spectra and the scaling factor is computed as the ratio of the mean intensity of the baseline to the mean intensity of each spectrum1. The intensities of all spectra are multiplied by their particular scaling factors1. However, this assumption of a linear correlation among sample spectra may be oversimplified1. This method has been conducted to identify differential metabolomics profiles among the banana’s 5 different senescence stages19. Moreover, linear baseline scaling has been applied to normalize nuclear magnetic resonance (NMR)-based metabolomics data20 and MS-based metabolomics data15.

Log-transform converts skewed metabolomics data to symmetric by non-linear transformation18. This method transforms the relationship of metabolites from multiplication to addition18. Log transformation is used to perfectly removes heteroscedasticity when the relative standard deviation is constant18. But the disadvantage of log transformation is that it is unable to deal with the value zero18. Furthermore, its effect on values with a large relative analytical standard deviation is problematic18. Log transformation was used to compare plasma amino acid patterns in LC/MS-based metabolomics analysis21. And it was applied to normalize the data in metabolomics analysis based on gas chromatography coupled with mass spectrometry (GC/MS)22.

Mean Normalization normalizes the data by mean value of all signals to eliminate background effect23. Intensity of each metabolite in a given sample is used by the mean of intensity of all variables in the sample24. In order to make the samples comparable, the means of the intensities for each experimental run are forced to be equal to one another using this method15. For example, each sample is scaled such that the mean of all abundances in a sample equals one24. This method has been applied to normalize the MS-based metabolomics data15.

Median Normalization is based on the assumption that the samples of a dataset are separated by a constant. It scales the samples so that they have the same median. For example, the median of the metabolite abundances in the sample equals one25. The median normalization, the commonly used method without the need for internal standards, is more practical than the sum normalization especially in situations where several saturated abundances may be associated with some of the factors of interest25. Median normalization has previously been used in MS-based proteomics analysis26 and metabolomics analysis15.

MSTUS (MS Total Useful Signal) utilizes the total signals of metabolites that are shared by all samples by assuming that the number of increased and decreased metabolic signals is relatively equivalent27, 28. Using MSTUS, the concentration of each metabolite is divided by the sum of the concentrations for all the measured metabolites in a given sample10. However, the validity of this hypothesis is questionable since an increase in the concentration of one metabolite may not necessarily be accompanied by a decrease in that of another metabolite28, 29. MSTUS is a more recent technique, typical used to normalize NMR-based metabolomics data30 and LC/MS-based metabolomics data11.

NOMIS (Normalization using Optimal Selection of Multiple Internal Standards) finds optimal normalization factor to remove unwanted systematic variation using variability information from multiple internal standard compounds31. NOMIS method can select best combinations of standard compounds for normalization using multiple linear regression31 and remove all correlations with the standards5. This method has a superior ability to reduce variability across the full spectrum of metabolites31. Moreover, the NOMIS method can be used in both supervised and unsupervised analysis6. Now NOMIS method has been used to normalize LC/MS-based metabolomics data31.

Pareto Scaling uses the square root of the standard deviation of the data as scaling factor32. Pareto scaling is able to reduce the weight of large fold changes in metabolite signals, which is more significantly than auto scaling1. But the dominant weight of extremely large fold changes may still be unchanged1. So the disadvantage of pareto scaling is the sensitivity to large fold changes18. Pareto scaling was used to reduce the mask effect from the abundant metabolites for LC/MS-based metabolomics dataset33.

Power Scaling aims at correcting for the heteroscedasticity and pseudo scaling18. Power scaling shows a similar transformation pattern as the log transformation, but it is not able to make multiplicative effects additive18. Unlike log transformation, power scaling can handle zero values18. Power scaling reduces heteroscedasticity without problems with small values, but its disadvantage is that the choice for square root is arbitrary18. Power scaling has been used to study the serum amino acid profiles and their variations in colorectal cancer patients for MS-based metabolomics34..

PQN (Probabilistic Quotient Normalization) transforms the metabolomics spectra according to an overall estimation on the most probable dilution35. This algorithm has been reported to be significantly robust and accurate comparing to the integral and the vector length normalizations35. There are three steps in the procedure of PQN1: (1) perform an integral normalization of each spectrum, then select a reference spectrum such as the median spectrum; (2) calculate the quotient between a given test spectrum and reference spectrum, then estimate the median of all quotients for each variable; (3) all variables of the test spectrum are divided by the median quotient. PQN is a robust method to account for dilution of complex biological mixtures for NMR metabolomics analysis35. Recently, PQN is also used to reduce unwanted variance for direct infusion MS metabolomics dataset36.

Quantile (Quantile Normalization) aims at achieving the same distribution of metabolic feature intensities across all samples, and the quantile-quantile plot in this method is used to visualize the distribution similarity1. Quantile normalization is motivated by the idea that the distribution of two data vectors is the same if the quantile-quantile plot is a straight diagonal line37. While a common and non-data driven distribution is generated using quantile normalization, an agreed standard could not be reached37. Quantile normalization has been adopted for high density oligonucleotide array data based on variance37, improving NMR-based metabolomics analysis1 and reducing non-biological systematic variation for LC/MS-based metabolomics data38.

Range Scaling is applied to put all measured intensities on an equal footing, which means that the measured intensity was divided by the range of those intensities over all samples39. The biological range (difference between the minimal and the maximal concentration of a certain metabolite) is used as the scaling factor for range scaling18. The advantage of range scaling is that relative concentration for each variable is generated after removing instrumental response factors39. Range scaling has a property that all levels of variation for the metabolites are treated equally39. But the disadvantage of range scaling is the sensitivity to outliers because only two values are used to estimate the biological range18. Range scaling has been used to fuse MS-based metabolomics data39.

RUV-2 (Remove Unwanted Variation-2) is based on a linear model designed for identifying differentially abundant metabolites, which requires factors of interest along with the factors of unwanted variation6. The advantages of the RUV-2 model include24: (1) the biological factors of interest are not removed along with the unwanted variation; (2) the method is applied to datasets without internal standards; (3) all unwanted biological variation can be accommodated; (4) it allows for the systematic integration of datasets from different sources; (5) it removes both observed and unobserved unwanted variations. However, RUV-2 method is not a global normalization method without a complete normalized dataset7, and it cannot be used prior to unsupervised analyses6. RUV-2 method has been used for normalizing and integrating MS-based metabolomics data24.

RUV-random (Remove Unwanted Variation-Random) is based on a linear mixed effects model utilizing quality control metabolites to obtain normalized data in metabolomics experiments6. RUV-random method attempts to remove overall unwanted variation6. RUV-random accommodates unwanted biological variation and retains the essential biological variation of interest6. Moreover, the unwanted variation component from any undetected experimental or biological variability can be removed6. This method is applicable in both supervised and unsupervised analysis6. RUV-random is used for removing unwanted variation for MS-based metabolomics data6.

SIS (Single Internal Standard) provides a normalized data matrix by subtracting the log metabolite abundance of a single internal standard from the log abundances of the metabolites in each sample24, 40. The SIS method assumes that every metabolite in a sample is subject to the same amount of unwanted variation and they can be simply measured by a single internal standard24. However, the use of a single internal standard may result in highly variable normalized values, which depend on the internal standard24. SIS method has been used to identify factors influencing extraction and derivatization of Arabidopsis thaliana samples in the GC/MS-based metabolomics analysis40.

Total Sum is a method normalizing the dataset by the sum of squares25. The sum of squares of all variables in a sample equals one, after each sample is scaled using sum normalization method6, 25. And total sum normalization relies on the self-averaging property6. A sample-specific constant assigns an appropriate weight to each sample, which attempts to minimize possible differences in concentration between samples6. Total sum normalization is used to correct for LC/MS-based metabolomics data41.

Vast Scaling (Variable Stability Scaling) weights each variable according to a metric of its stability and it is an extension of auto scaling42. This method focuses on stable variables that do not show strong variation using the standard deviation and the coefficient of variation is as scaling factors18. Vast scaling can be used in unsupervised and supervised analysis, but it is not appropriate for large induced variation without group structure18. Moreover, vast scaling is used for enhancing multivariate models for classification and biomarker identification in metabolomics analysis42, which appears to be stable and robust for NMR and GC/MS-based metabolomics data2.

VSN (Variance Stabilization Normalization) is one of the non-linear methods aiming to keep the variance constant over the entire data range1, 43. VSN approaches the logarithm for large values to remove heteroscedasticity using the inverse hyperbolic sine1. For small intensities, it performs linear transformation behavior to make the variance unchanged1. VSN was originally developed for normalizing single and two-channel microarray data44, and currently also used to determine metabolic profiles of liver tissue during early cancer development by GC/MS22.

As an extension of the auto scaling, Vast Scaling scales the metabolic signals based on the coefficient of variation12. Vast scaling has been used to identify prognostic factors for breast cancer patients from the magnetic resonance based metabolomics34.

References:

1. Kohl, S.M. et al. State-of-the art data normalization methods improve NMR-based metabolomic analysis. Metabolomics 8, 146-160 (2012).

2. Gromski, P.S., Xu, Y., Hollywood, K.A., Turner, M.L. & Goodacre, R. The influence of scaling metabolomics data on model classification accuracy. Metabolomics 11, 684-695 (2015).

3. Weber, C.M. et al. Evaluation of a gas sensor array and pattern recognition for the identification of bladder cancer from urine headspace. Analyst 136, 359-364 (2011).

4. Struck, W. et al. Liquid chromatography tandem mass spectrometry study of urinary nucleosides as potential cancer markers. J. Chromatogr. A 1283, 122-131 (2013).

5. Redestig, H. et al. Compensation for systematic cross-contribution improves normalization of mass spectrometry based metabolomics data. Analytical chemistry 81, 7974-7980 (2009).

6. De Livera, A.M. et al. Statistical methods for handling unwanted variation in metabolomics data. Analytical chemistry 87, 3606-3615 (2015).

7. Jauhiainen, A. et al. Normalization of metabolomics data with applications to correlation maps. Bioinformatics 30, 2155-2161 (2014).

8. Astrand, M. Contrast normalization of oligonucleotide arrays. J. Comput. Biol. 10, 95-102 (2003).

9. Shi, X. et al. Metabolomic analysis of the effects of polychlorinated biphenyls in nonalcoholic fatty liver disease. J. Proteome Res. 11, 3805-3815 (2012).

10. Saccenti, E. Correlation Patterns in Experimental Data Are Affected by Normalization Procedures: Consequences for Data Analysis and Network Inference. Journal of proteome research 16, 619-634 (2017).

11. Workman, C. et al. A new non-linear normalization method for reducing variability in DNA microarray experiments. Genome Biol. 3, research0048 (2002).

12. Contrepois, K., Jiang, L. & Snyder, M. Optimized Analytical Procedures for the Untargeted Metabolomic Profiling of Human Urine and Plasma by Combining Hydrophilic Interaction (HILIC) and Reverse-Phase Liquid Chromatography (RPLC)-Mass Spectrometry. Molecular & cellular proteomics : MCP 14, 1684-1695 (2015).

13. Cleveland, W.S. & Devlin, S.J. Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American statistical association 83, 596-610 (1988).

14. Ballman, K.V., Grill, D.E., Oberg, A.L. & Therneau, T.M. Faster cyclic loess: normalizing RNA arrays via linear models. Bioinformatics 20, 2778-2786 (2004).

15. Ejigu, B.A. et al. Evaluation of normalization methods to pave the way towards large-scale LC-MS-based metabolomics profiling experiments. OMICS 17, 473-485 (2013).

16. Valikangas, T., Suomi, T. & Elo, L.L. A systematic evaluation of normalization methods in quantitative label-free proteomics. Briefings in bioinformatics (2016).

17. Karpievitch, Y.V., Nikolic, S.B., Wilson, R., Sharman, J.E. & Edwards, L.M. Metabolomics data normalization with EigenMS. PloS one 9, e116221 (2014).

18. van den Berg, R.A., Hoefsloot, H.C.J., Westerhuis, J.A., Smilde, A.K. & van der Werf, M.J. Centering, scaling, and transformations: improving the biological information content of metabolomics data. Bmc Genomics 7, 142 (2006).

19. Yuan, Y. et al. Metabolomic analyses of banana during postharvest senescence by 1H-high resolution-NMR. Food Chem. 218, 406-412 (2017).

20. Backshall, A., Sharma, R., Clarke, S.J. & Keun, H.C. Pharmacometabonomic profiling as a predictor of toxicity in patients with inoperable colorectal cancer treated with capecitabine. Clin. Cancer Res. 17, 3019-3028 (2011).

21. Klepacki, J. et al. Amino acids in a targeted versus a non-targeted metabolomics LC-MS/MS assay. Are the results consistent? Clinical biochemistry 49, 955-961 (2016).

22. Ibarra, R. et al. Metabolomic analysis of liver tissue from the VX2 rabbit model of secondary liver tumors. HPB Surg. 2014, 310372 (2014).

23. Andjelkovic, V. & Thompson, R. Changes in gene expression in maize kernel in response to water and salt stress. Plant cell reports 25, 71-79 (2006).

24. De Livera, A.M. et al. Normalizing and integrating metabolomics data. Analytical chemistry 84, 10768-10776 (2012).

25. De Livera, A.M., Olshansky, M. & Speed, T.P. Statistical analysis of metabolomics data. Metabolomics Tools for Natural Product Discovery: Methods and Protocols, 291-307 (2013).

26. Ting, L. et al. Normalization and statistical analysis of quantitative proteomics data generated by metabolic labeling. Molecular & cellular proteomics : MCP 8, 2227-2242 (2009).

27. Warrack, B.M. et al. Normalization strategies for metabonomic analysis of urine samples. J. Chromatogr. B Analyt. Technol. Biomed. Life Sci. 877, 547-552 (2009).

28. Jacob, C.C., Dervilly-Pinel, G., Biancotto, G. & Le Bizec, B. Evaluation of specific gravity as normalization strategy for cattle urinary metabolome analysis. Metabolomics 10, 627-637 (2014).

29. Chen, Y. et al. Combination of injection volume calibration by creatinine and MS signals' normalization to overcome urine variability in LC-MS-based metabolomics studies. Anal. Chem. 85, 7659-7665 (2013).

30. Craig, A., Cloarec, O., Holmes, E., Nicholson, J.K. & Lindon, J.C. Scaling and normalization effects in NMR spectroscopic metabonomic data sets. Analytical chemistry 78, 2262-2267 (2006).

31. Sysi-Aho, M., Katajamaa, M., Yetukuri, L. & Oresic, M. Normalization method for metabolomics data using optimal selection of multiple internal standards. BMC bioinformatics 8, 93 (2007).

32. Eriksson, L. et al. Using chemometrics for navigating in the large data sets of genomics, proteomics, and metabonomics (gpm). Anal. Bioanal. Chem. 380, 419-429 (2004).

33. Yang, J., Zhao, X., Lu, X., Lin, X. & Xu, G. A data preprocessing strategy for metabolomics to reduce the mask effect in data analysis. Front. Mol. Biosci. 2, 4 (2015).

34. Leichtle, A.B. et al. Serum amino acid profiles and their alterations in colorectal cancer. Metabolomics 8, 643-653 (2012).

35. Dieterle, F., Ross, A., Schlotterbeck, G. & Senn, H. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Anal. Chem. 78, 4281-4290 (2006).

36. Kirwan, J.A., Weber, R.J., Broadhurst, D.I. & Viant, M.R. Direct infusion mass spectrometry metabolomics dataset: a benchmark for data processing and quality control. Scientific data 1, 140012 (2014).

37. Bolstad, B.M., Irizarry, R.A., Astrand, M. & Speed, T.P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185-193 (2003).

38. Lee, J. et al. Quantile normalization approach for liquid chromatography-mass spectrometry-based metabolomic data from healthy human volunteers. Analytical sciences : the international journal of the Japan Society for Analytical Chemistry 28, 801-805 (2012).

39. Smilde, A.K., van der Werf, M.J., Bijlsma, S., van der Werff-van der Vat, B.J. & Jellema, R.H. Fusion of mass spectrometry-based metabolomics data. Anal. Chem. 77, 6729-6736 (2005).

40. Gullberg, J., Jonsson, P., Nordstrom, A., Sjostrom, M. & Moritz, T. Design of experiments: an efficient strategy to identify factors influencing extraction and derivatization of Arabidopsis thaliana samples in metabolomic studies with gas chromatography/mass spectrometry. Analytical biochemistry 331, 283-295 (2004).

41. Vogl, F.C. et al. Evaluation of dilution and normalization strategies to correct for urinary output in HPLC-HRTOFMS metabolomics. Analytical and bioanalytical chemistry 408, 8483-8493 (2016).

42. Keun, H.C. et al. Improved analysis of multivariate data by variable stability scaling: application to NMR-based metabolic profiling. Anal Chim Acta 490, 265-276 (2003).

43. Huber, W., von Heydebreck, A., Sultmann, H., Poustka, A. & Vingron, M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18 Suppl 1, S96-104 (2002).

44. Kultima, K. et al. Development and evaluation of normalization methods for label-free relative quantification of endogenous peptides. Mol. Cell Proteomics 8, 2285-2295 (2009).