Description of normalization methods

Auto Scaling (Unit Variance Scaling, UV) is one of the simplest methods adjusting metabolic variances, which scales metabolic signals based on the standard deviation of metabolomics data¹. This method scales all metabolites to unit variance, and all metabolites are equally important and comparably scaled². The data is analyzed on the basis of correlations and standard deviation of all metabolites is one after auto scaling¹. But the disadvantage of auto scaling is that analytical errors may be amplified due to dilution effects¹. Auto scaling has been used to improve the diagnosis of bladder cancer using gas sensor arrays³ and to identify urinary nucleoside markers from urogenital cancer patients by mass spectrometry (MS)-based metabolomics⁴.

CCMN (Cross-Contribution Compensating Multiple Standard Normalization, CRMN) is applicable to monitor systematic error from randomized and designed experiments using multiple internal standards⁵. CCMN compensates for systematic cross-contribution effects that can be traced back to a linear association with experimental design⁵, and is superior at purifying the signal of interest using multiple internal standards⁵. But care needs to be taken when normalizing the data using the factors of interest prior to carrying out unsupervised analysis⁶. CCMN is mainly aimed at MS-based metabolomics data and its inclusion will improve the precision of current metabolite profiling protocols⁷.

Contrast (Contrast Normalization) comes from the integration of MA-plots and logged Bland-Altman plots, which assumes the presence of non-linear biases¹. The input data is logged and transformed into a contrast space by means of an orthonormal transformation matrix1. But the use of a log function in this method may impede the processing of zeros and negative numbers, which requires the conversion of non-positive numbers to an extremely small value1. The contrast method has been applied in oligonucleotide arrays to normalizing feature intensities⁸ and also employed to reveal the role of polychlorinated biphenyls in non-alcoholic fatty liver disease of MS-based metabolic profiling⁹.

Cubic Splines is one of the non-linear baseline methods assuming the existence of non-linear relationships between baseline and individual spectra¹. Like quantile normalization, cubic splines aims to make the distribution of the metabolite concentrations similar across all samples¹⁰. The geometric or arithmetic mean of the concentrations of each metabolite across all samples is regarded as the baseline sample¹⁰. A set of evenly distributed quantiles from both the baseline and target samples is used to fit a smooth cubic spline¹⁰. Finally, a spline function generator uses the generated set of interpolated splines to fit the parameters of a natural cubic spline¹⁰. Cubic splines has been adopted to reduce variability in DNA microarray experiments by normalizing all signal channels to a target array¹¹. Moreover, it has been applied in MS-based metabolomics profiling enabling to improve the comprehensiveness of global metabolic profiling of body fluids¹².

Similar to contrast normalization, Cyclic Loess (Cyclic Locally Weighted Regression) originates also from the combination of MA-plot and logged Bland-Altman plot by assuming the existence of non-linear bias¹, and can estimate a regression surface using multivariate smoothing procedure¹³. However, cyclic loess is one of the most time-consuming one among the normalization methods, and the amount of time grows exponentially as the number of sample increases¹⁴. Cyclic loess has been applied in MS-based metabolomics profiling, revealing that this method was able to remove the systematic effect¹⁵.

EigenMS removes bias of unknown complexity from the Liquid Chromatography coupled with Mass Spectrometry (LC/MS)-based metabolomics data, allowing for increased sensitivity in differential analysis. EigenMS normalization aims at preserving the original differences while removing the bias from the data¹⁶. It works by 3 steps¹⁷: (1) EigenMS preserves the true differences in the metabolomics data by estimating treatment effects with an ANOVA model; (2) singular value decomposition of the residuals matrix is used to determine bias trends in the data; (3) the number of bias trends is estimated via a permutation test and the effects of the bias trends are eliminated. EigenMS has applied in MS-based quantitative label-free proteomics profiling¹⁶ and MS-based metabolomics analysis¹⁷.

Level Scaling transforms metabolic signal variation into variation relative to the average metabolic signal by scaling according to the mean signal, so the resulting values are changes in percentages compared to the mean concentration¹⁸. This method is especially suitable for the circumstances when huge relative variations are of great interest (e.g., studying the stress responses)¹⁸. Level scaling is used for identification of biomarkers focusing on relative response, but the disadvantage of it is the inflation of the measurement errors¹⁸. Level scaling has been used to identify urinary nucleoside markers from urogenital cancer patients in MS-based metabolomics analysis⁴.

Linear Baseline (Linear Baseline Scaling) maps each spectrum to the baseline based on the assumption of a constant linear relationship between each feature of a given spectrum and the baseline¹. The baseline is the median of each feature across all spectra and the scaling factor is computed as the ratio of the mean intensity of the baseline to the mean intensity of each spectrum¹. The intensities of all spectra are multiplied by their particular scaling factors¹. However, this assumption of a linear correlation among sample spectra may be oversimplified1. This method has been conducted to identify differential metabolomics profiles among the banana’s 5 different senescence stages¹⁹. Moreover, linear baseline scaling has been applied to normalize nuclear magnetic resonance (NMR)-based metabolomics data²⁰ and MS-based metabolomics data¹⁵.

Log-transform converts skewed metabolomics data to symmetric by non-linear transformation¹⁸. This method transforms the relationship of metabolites from multiplication to addition¹⁸. Log transformation is used to perfectly removes heteroscedasticity when the relative standard deviation is constant¹⁸. But the disadvantage of log transformation is that it is unable to deal with the value zero¹⁸. Furthermore, its effect on values with a large relative analytical standard deviation is problematic¹⁸. Log transformation was used to compare plasma amino acid patterns in LC/MS-based metabolomics analysis²¹. And it was applied to normalize the data in metabolomics analysis based on gas chromatography coupled with mass spectrometry (GC/MS)²².

Mean Normalization normalizes the data by mean value of all signals to eliminate background effect²³. Intensity of each metabolite in a given sample is used by the mean of intensity of all variables in the sample²⁴. In order to make the samples comparable, the means of the intensities for each experimental run are forced to be equal to one another using this method¹⁵. For example, each sample is scaled such that the mean of all abundances in a sample equals one²⁴. This method has been applied to normalize the MS-based metabolomics data¹⁵.

Median Normalization is based on the assumption that the samples of a dataset are separated by a constant. It scales the samples so that they have the same median. For example, the median of the metabolite abundances in the sample equals one²⁵. The median normalization, the commonly used method without the need for internal standards, is more practical than the sum normalization especially in situations where several saturated abundances may be associated with some of the factors of interest²⁵. Median normalization has previously been used in MS-based proteomics analysis²⁶ and metabolomics analysis¹⁵.

MSTUS (MS Total Useful Signal) utilizes the total signals of metabolites that are shared by all samples by assuming that the number of increased and decreased metabolic signals is relatively equivalent^{27, 28}. Using MSTUS, the concentration of each metabolite is divided by the sum of the concentrations for all the measured metabolites in a given sample¹⁰. However, the validity of this hypothesis is questionable since an increase in the concentration of one metabolite may not necessarily be accompanied by a decrease in that of another metabolite^{28, 29}. MSTUS is a more recent technique, typical used to normalize NMR-based metabolomics data³⁰ and LC/MS-based metabolomics data¹¹.

NOMIS (Normalization using Optimal Selection of Multiple Internal Standards) finds optimal normalization factor to remove unwanted systematic variation using variability information from multiple internal standard compounds³¹. NOMIS method can select best combinations of standard compounds for normalization using multiple linear regression³¹ and remove all correlations with the standards⁵. This method has a superior ability to reduce variability across the full spectrum of metabolites³¹. Moreover, the NOMIS method can be used in both supervised and unsupervised analysis⁶. Now NOMIS method has been used to normalize LC/MS-based metabolomics data³¹.

Pareto Scaling uses the square root of the standard deviation of the data as scaling factor³². Pareto scaling is able to reduce the weight of large fold changes in metabolite signals, which is more significantly than auto scaling¹. But the dominant weight of extremely large fold changes may still be unchanged¹. So the disadvantage of pareto scaling is the sensitivity to large fold changes¹⁸. Pareto scaling was used to reduce the mask effect from the abundant metabolites for LC/MS-based metabolomics dataset³³.

Power Scaling aims at correcting for the heteroscedasticity and pseudo scaling¹⁸. Power scaling shows a similar transformation pattern as the log transformation, but it is not able to make multiplicative effects additive18. Unlike log transformation, power scaling can handle zero values¹⁸. Power scaling reduces heteroscedasticity without problems with small values, but its disadvantage is that the choice for square root is arbitrary¹⁸. Power scaling has been used to study the serum amino acid profiles and their variations in colorectal cancer patients for MS-based metabolomics³⁴..

PQN (Probabilistic Quotient Normalization) transforms the metabolomics spectra according to an overall estimation on the most probable dilution³⁵. This algorithm has been reported to be significantly robust and accurate comparing to the integral and the vector length normalizations³⁵. There are three steps in the procedure of PQN¹: (1) perform an integral normalization of each spectrum, then select a reference spectrum such as the median spectrum; (2) calculate the quotient between a given test spectrum and reference spectrum, then estimate the median of all quotients for each variable; (3) all variables of the test spectrum are divided by the median quotient. PQN is a robust method to account for dilution of complex biological mixtures for NMR metabolomics analysis³⁵. Recently, PQN is also used to reduce unwanted variance for direct infusion MS metabolomics dataset³⁶.

Quantile (Quantile Normalization) aims at achieving the same distribution of metabolic feature intensities across all samples, and the quantile-quantile plot in this method is used to visualize the distribution similarity¹. Quantile normalization is motivated by the idea that the distribution of two data vectors is the same if the quantile-quantile plot is a straight diagonal line³⁷. While a common and non-data driven distribution is generated using quantile normalization, an agreed standard could not be reached³⁷. Quantile normalization has been adopted for high density oligonucleotide array data based on variance³⁷, improving NMR-based metabolomics analysis¹ and reducing non-biological systematic variation for LC/MS-based metabolomics data³⁸.

Range Scaling is applied to put all measured intensities on an equal footing, which means that the measured intensity was divided by the range of those intensities over all samples³⁹. The biological range (difference between the minimal and the maximal concentration of a certain metabolite) is used as the scaling factor for range scaling¹⁸. The advantage of range scaling is that relative concentration for each variable is generated after removing instrumental response factors³⁹. Range scaling has a property that all levels of variation for the metabolites are treated equally³⁹. But the disadvantage of range scaling is the sensitivity to outliers because only two values are used to estimate the biological range¹⁸. Range scaling has been used to fuse MS-based metabolomics data³⁹.

RUV-2 (Remove Unwanted Variation-2) is based on a linear model designed for identifying differentially abundant metabolites, which requires factors of interest along with the factors of unwanted variation⁶. The advantages of the RUV-2 model include²⁴: (1) the biological factors of interest are not removed along with the unwanted variation; (2) the method is applied to datasets without internal standards; (3) all unwanted biological variation can be accommodated; (4) it allows for the systematic integration of datasets from different sources; (5) it removes both observed and unobserved unwanted variations. However, RUV-2 method is not a global normalization method without a complete normalized dataset⁷, and it cannot be used prior to unsupervised analyses⁶. RUV-2 method has been used for normalizing and integrating MS-based metabolomics data²⁴.

RUV-random (Remove Unwanted Variation-Random) is based on a linear mixed effects model utilizing quality control metabolites to obtain normalized data in metabolomics experiments⁶. RUV-random method attempts to remove overall unwanted variation6. RUV-random accommodates unwanted biological variation and retains the essential biological variation of interest⁶. Moreover, the unwanted variation component from any undetected experimental or biological variability can be removed⁶. This method is applicable in both supervised and unsupervised analysis⁶. RUV-random is used for removing unwanted variation for MS-based metabolomics data⁶.

SIS (Single Internal Standard) provides a normalized data matrix by subtracting the log metabolite abundance of a single internal standard from the log abundances of the metabolites in each sample^{24, 40}. The SIS method assumes that every metabolite in a sample is subject to the same amount of unwanted variation and they can be simply measured by a single internal standard²⁴. However, the use of a single internal standard may result in highly variable normalized values, which depend on the internal standard²⁴. SIS method has been used to identify factors influencing extraction and derivatization of Arabidopsis thaliana samples in the GC/MS-based metabolomics analysis⁴⁰.

Total Sum is a method normalizing the dataset by the sum of squares²⁵. The sum of squares of all variables in a sample equals one, after each sample is scaled using sum normalization method^{6, 25}. And total sum normalization relies on the self-averaging property⁶. A sample-specific constant assigns an appropriate weight to each sample, which attempts to minimize possible differences in concentration between samples⁶. Total sum normalization is used to correct for LC/MS-based metabolomics data⁴¹.

Vast Scaling (Variable Stability Scaling) weights each variable according to a metric of its stability and it is an extension of auto scaling⁴². This method focuses on stable variables that do not show strong variation using the standard deviation and the coefficient of variation is as scaling factors¹⁸. Vast scaling can be used in unsupervised and supervised analysis, but it is not appropriate for large induced variation without group structure¹⁸. Moreover, vast scaling is used for enhancing multivariate models for classification and biomarker identification in metabolomics analysis⁴², which appears to be stable and robust for NMR and GC/MS-based metabolomics data².

VSN (Variance Stabilization Normalization) is one of the non-linear methods aiming to keep the variance constant over the entire data range^{1, 43}. VSN approaches the logarithm for large values to remove heteroscedasticity using the inverse hyperbolic sine¹. For small intensities, it performs linear transformation behavior to make the variance unchanged¹. VSN was originally developed for normalizing single and two-channel microarray data⁴⁴, and currently also used to determine metabolic profiles of liver tissue during early cancer development by GC/MS²².

As an extension of the auto scaling, Vast Scaling scales the metabolic signals based on the coefficient of variation¹². Vast scaling has been used to identify prognostic factors for breast cancer patients from the magnetic resonance based metabolomics³⁴.

References:

1. Kohl, S.M. et al. State-of-the art data normalization methods improve NMR-based metabolomic analysis. Metabolomics 8, 146-160 (2012).

2. Gromski, P.S., Xu, Y., Hollywood, K.A., Turner, M.L. & Goodacre, R. The influence of scaling metabolomics data on model classification accuracy. Metabolomics 11, 684-695 (2015).

3. Weber, C.M. et al. Evaluation of a gas sensor array and pattern recognition for the identification of bladder cancer from urine headspace. Analyst 136, 359-364 (2011).

4. Struck, W. et al. Liquid chromatography tandem mass spectrometry study of urinary nucleosides as potential cancer markers. J. Chromatogr. A 1283, 122-131 (2013).

5. Redestig, H. et al. Compensation for systematic cross-contribution improves normalization of mass spectrometry based metabolomics data. Analytical chemistry 81, 7974-7980 (2009).

6. De Livera, A.M. et al. Statistical methods for handling unwanted variation in metabolomics data. Analytical chemistry 87, 3606-3615 (2015).

7. Jauhiainen, A. et al. Normalization of metabolomics data with applications to correlation maps. Bioinformatics 30, 2155-2161 (2014).

8. Astrand, M. Contrast normalization of oligonucleotide arrays. J. Comput. Biol. 10, 95-102 (2003).

9. Shi, X. et al. Metabolomic analysis of the effects of polychlorinated biphenyls in nonalcoholic fatty liver disease. J. Proteome Res. 11, 3805-3815 (2012).

10. Saccenti, E. Correlation Patterns in Experimental Data Are Affected by Normalization Procedures: Consequences for Data Analysis and Network Inference. Journal of proteome research 16, 619-634 (2017).

11. Workman, C. et al. A new non-linear normalization method for reducing variability in DNA microarray experiments. Genome Biol. 3, research0048 (2002).

12. Contrepois, K., Jiang, L. & Snyder, M. Optimized Analytical Procedures for the Untargeted Metabolomic Profiling of Human Urine and Plasma by Combining Hydrophilic Interaction (HILIC) and Reverse-Phase Liquid Chromatography (RPLC)-Mass Spectrometry. Molecular & cellular proteomics : MCP 14, 1684-1695 (2015).

13. Cleveland, W.S. & Devlin, S.J. Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American statistical association 83, 596-610 (1988).

14. Ballman, K.V., Grill, D.E., Oberg, A.L. & Therneau, T.M. Faster cyclic loess: normalizing RNA arrays via linear models. Bioinformatics 20, 2778-2786 (2004).

15. Ejigu, B.A. et al. Evaluation of normalization methods to pave the way towards large-scale LC-MS-based metabolomics profiling experiments. OMICS 17, 473-485 (2013).

16. Valikangas, T., Suomi, T. & Elo, L.L. A systematic evaluation of normalization methods in quantitative label-free proteomics. Briefings in bioinformatics (2016).

17. Karpievitch, Y.V., Nikolic, S.B., Wilson, R., Sharman, J.E. & Edwards, L.M. Metabolomics data normalization with EigenMS. PloS one 9, e116221 (2014).

18. van den Berg, R.A., Hoefsloot, H.C.J., Westerhuis, J.A., Smilde, A.K. & van der Werf, M.J. Centering, scaling, and transformations: improving the biological information content of metabolomics data. Bmc Genomics 7, 142 (2006).

19. Yuan, Y. et al. Metabolomic analyses of banana during postharvest senescence by 1H-high resolution-NMR. Food Chem. 218, 406-412 (2017).

20. Backshall, A., Sharma, R., Clarke, S.J. & Keun, H.C. Pharmacometabonomic profiling as a predictor of toxicity in patients with inoperable colorectal cancer treated with capecitabine. Clin. Cancer Res. 17, 3019-3028 (2011).

21. Klepacki, J. et al. Amino acids in a targeted versus a non-targeted metabolomics LC-MS/MS assay. Are the results consistent? Clinical biochemistry 49, 955-961 (2016).

22. Ibarra, R. et al. Metabolomic analysis of liver tissue from the VX2 rabbit model of secondary liver tumors. HPB Surg. 2014, 310372 (2014).

23. Andjelkovic, V. & Thompson, R. Changes in gene expression in maize kernel in response to water and salt stress. Plant cell reports 25, 71-79 (2006).

24. De Livera, A.M. et al. Normalizing and integrating metabolomics data. Analytical chemistry 84, 10768-10776 (2012).

25. De Livera, A.M., Olshansky, M. & Speed, T.P. Statistical analysis of metabolomics data. Metabolomics Tools for Natural Product Discovery: Methods and Protocols, 291-307 (2013).

26. Ting, L. et al. Normalization and statistical analysis of quantitative proteomics data generated by metabolic labeling. Molecular & cellular proteomics : MCP 8, 2227-2242 (2009).

27. Warrack, B.M. et al. Normalization strategies for metabonomic analysis of urine samples. J. Chromatogr. B Analyt. Technol. Biomed. Life Sci. 877, 547-552 (2009).

28. Jacob, C.C., Dervilly-Pinel, G., Biancotto, G. & Le Bizec, B. Evaluation of specific gravity as normalization strategy for cattle urinary metabolome analysis. Metabolomics 10, 627-637 (2014).

29. Chen, Y. et al. Combination of injection volume calibration by creatinine and MS signals' normalization to overcome urine variability in LC-MS-based metabolomics studies. Anal. Chem. 85, 7659-7665 (2013).

30. Craig, A., Cloarec, O., Holmes, E., Nicholson, J.K. & Lindon, J.C. Scaling and normalization effects in NMR spectroscopic metabonomic data sets. Analytical chemistry 78, 2262-2267 (2006).

31. Sysi-Aho, M., Katajamaa, M., Yetukuri, L. & Oresic, M. Normalization method for metabolomics data using optimal selection of multiple internal standards. BMC bioinformatics 8, 93 (2007).

32. Eriksson, L. et al. Using chemometrics for navigating in the large data sets of genomics, proteomics, and metabonomics (gpm). Anal. Bioanal. Chem. 380, 419-429 (2004).

33. Yang, J., Zhao, X., Lu, X., Lin, X. & Xu, G. A data preprocessing strategy for metabolomics to reduce the mask effect in data analysis. Front. Mol. Biosci. 2, 4 (2015).

34. Leichtle, A.B. et al. Serum amino acid profiles and their alterations in colorectal cancer. Metabolomics 8, 643-653 (2012).

35. Dieterle, F., Ross, A., Schlotterbeck, G. & Senn, H. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Anal. Chem. 78, 4281-4290 (2006).

36. Kirwan, J.A., Weber, R.J., Broadhurst, D.I. & Viant, M.R. Direct infusion mass spectrometry metabolomics dataset: a benchmark for data processing and quality control. Scientific data 1, 140012 (2014).

37. Bolstad, B.M., Irizarry, R.A., Astrand, M. & Speed, T.P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185-193 (2003).

38. Lee, J. et al. Quantile normalization approach for liquid chromatography-mass spectrometry-based metabolomic data from healthy human volunteers. Analytical sciences : the international journal of the Japan Society for Analytical Chemistry 28, 801-805 (2012).

39. Smilde, A.K., van der Werf, M.J., Bijlsma, S., van der Werff-van der Vat, B.J. & Jellema, R.H. Fusion of mass spectrometry-based metabolomics data. Anal. Chem. 77, 6729-6736 (2005).

40. Gullberg, J., Jonsson, P., Nordstrom, A., Sjostrom, M. & Moritz, T. Design of experiments: an efficient strategy to identify factors influencing extraction and derivatization of Arabidopsis thaliana samples in metabolomic studies with gas chromatography/mass spectrometry. Analytical biochemistry 331, 283-295 (2004).

41. Vogl, F.C. et al. Evaluation of dilution and normalization strategies to correct for urinary output in HPLC-HRTOFMS metabolomics. Analytical and bioanalytical chemistry 408, 8483-8493 (2016).

42. Keun, H.C. et al. Improved analysis of multivariate data by variable stability scaling: application to NMR-based metabolic profiling. Anal Chim Acta 490, 265-276 (2003).

43. Huber, W., von Heydebreck, A., Sultmann, H., Poustka, A. & Vingron, M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18 Suppl 1, S96-104 (2002).

44. Kultima, K. et al. Development and evaluation of normalization methods for label-free relative quantification of endogenous peptides. Mol. Cell Proteomics 8, 2285-2295 (2009).