Zhejiang U | College of Pharmaceutical Sciences | 中文版
IDRB: Research Projects

Our research projects in the fields of computer-aided drug design, computational biology and bioinformatics are listed as below:

  1. Comparative study of nature-derived FDA approved drugs and Traditional Chinese Medicine (TCM) to reveal the mechanism of TCM. The mechanism of majority of the TCM is still unclear, this study tries to conduct a comparison between FDA approved drugs and TCM on the phylogenetic perspective to have an understanding on the distribution pattern between these two drug systems, which is expected to give us a deep understanding of the mechanism of TCM.
  2. Deriving stable microarray cancer-differentiating signatures by machine learning and feature-elimination methods, and evaluating consensus scoring of multiple random sampling and Gene-Ranking’s consistency. Signatures identified reflect disease mechanism, and can provide indicators for disease diagnosis. My current interest lies in identifying biomarkers for breast cancer and major depression.
  3. Identifying next generation innovative therapeutic targets for specific disease types, such as Obesity, Major Depression, Cancer, and so on. Collective methods are applied, which include: A. genetic sequence similarity analysis between drug-binding domains; B. computation of number of human similarity proteins, number of affiliated human pathways, and number of human tissues of a target; C. structural comparison between drug-binding domain; D. target classification based on physicochemical characteristics detected by machine learning.
  4. Led and conduct the development of bioinformatics databases, which collect information of Biology, Pharmacy, Chemistry and so on. Moreover, we are interested in constructing innovative software for drug discovery and bioinformatics, which involves design and implementation of an integrated bioinformatics software system for novel therapeutic target agent explorations.
  5. Conducting biostatistics study on the distribution of molecules with therapeutic effect, especially drugs approved and in clinical trial, across all biological species, and identifying key species for ecological protection. Comprehensive biostatistics studies on therapeutic targets in clinical trial, and comparative analysis against targets with drugs approved. Studying correlating groups of genes by utilizing graph theory for filtering complex gene correlation network. Genetic variation identified indicate complex inter- and intra-individual differences.
  1. Drug discovery prospect from untapped species: indications from approved natural product drugs
  2. Due to extensive bioprospecting efforts of the past and technology factors, there have been questions about drug discovery prospect from untapped species. We analyzed recent trends of approved drugs derived from previously untapped species, which show no sign of untapped drug-productive species being near extinction and suggest high probability of deriving new drugs from new species in existing drug-productive species families and clusters. Case histories of recently approved drugs reveal useful strategies for deriving new drugs from the scaffolds and pharmacophores of the natural product leads of these untapped species. New technologies such as cryptic gene-cluster exploration may generate novel natural products with highly anticipated potential impact on drug discovery.

  3. Clustered patterns of species origins of nature-derived FDA approved and clinical-trial drugs and clues for future bioprospecting
  4. Many drugs are nature derived. Low drug productivity has renewed interest in natural products as drug-discovery sources. Nature-derived drugs are composed of dozens of molecular scaffolds generated by specific secondary-metabolite gene clusters in selected species. It can be hypothesized that drug-like structures probably are distributed in selective groups of species. We compared the species origins of 939 approved and 369 clinical-trial drugs with those of 119 preclinical drugs and 19,721 bioactive natural products. In contrast to the scattered distribution of bioactive natural products, these drugs are clustered into 144 of the 6,763 known species families in nature, with 80% of the approved drugs and 67% of the clinical-trial drugs concentrated in 17 and 30 drug-prolific families, respectively. Four lines of evidence from historical drug data, 13,548 marine natural products, 767 medicinal plants, and 19,721 bioactive natural products suggest that drugs are derived mostly from preexisting drug-productive families. Drug-productive clusters expand slowly by conventional technologies. The lack of drugs outside drug-productive families is not necessarily the result of under-exploration or late exploration by conventional technologies. New technologies that explore cryptic gene clusters, pathways, interspecies crosstalk, and high-throughput fermentation enable the discovery of novel natural products. The potential impact of these technologies on drug productivity and on the distribution patterns of drug-productive families is yet to be revealed.
    This work has been highlighted and reported by:
    The following media have covered this work:

  5. Construction of Therapeutic Target Database (TTD): a resource for facilitating target-oriented drug discovery
  6. Knowledge and investigation of therapeutic targets (responsible for drug efficacy) and the targeted drugs facilitate target and drug discovery and validation. Therapeutic Target Database (TTD, http://bidd.nus.edu.sg/group/ttd/ttd.asp) has been developed to provide comprehensive information about efficacy targets and the corresponding approved, clinical trial and investigative drugs. Since its last update, major improvements and updates have been made to TTD. In addition to the significant increase of data content (from 1894 targets and 5028 drugs to 2025 targets and 17,816 drugs), we added target validation information (drug potency against target, effect against disease models and effect of target knockout, knockdown or genetic variations) for 932 targets, and 841 quantitative structure activity relationship models for active compounds of 228 chemical types against 121 targets. Moreover, we added the data from our previous drug studies including 3681 multi-target agents against 108 target pairs, 116 drug combinations with their synergistic, additive, antagonistic, potentiative or reductive mechanisms, 1427 natural product-derived approved, clinical trial and pre-clinical drugs and cross-links to the clinical trial information page in the ClinicalTrials.gov database for 770 clinical trial drugs. These updates are useful for facilitating target discovery and validation, drug lead discovery and optimization, and the development of multi-target drugs and drug combinations.
    This work has been highlighted and reported by:
    • "FACULTYof1000" as "the top 2% of published articles in biology and medicine" and "a most useful resource for scientists and companies working on drug discovery and validation, drug lead discovery and optimization, and the development of multi-target drugs and drug combinations".
    • Prof. Chris Southan in his blog as "Therapeutic Target Database in PubChem".

  7. Construction of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequences
  8. Sequence-derived structural and physicochemical features have been extensively used for analyzing and predicting structural, functional, expression and interaction profiles of proteins and peptides. PROFEAT has been developed as a web server for computing commonly used features of proteins and peptides from amino acid sequence. To facilitate more extensive studies of protein and peptides, numerous improvements and updates have been made to PROFEAT. We added new functions for computing descriptors of proteinCprotein and proteinCsmall molecule interactions, segment descriptors for local properties of protein sequences, topological descriptors for peptide sequences and small molecule structures. We also added new feature groups for proteins and peptides (pseudo-amino acid composition, amphiphilic pseudo-amino acid composition, total amino acid properties and atomic-level topological descriptors) as well as for small molecules (atomic-level topological descriptors). Overall, PROFEAT computes 11 feature groups of descriptors for proteins and peptides, and a feature group of more than 400 descriptors for small molecules plus the derived features for proteinCprotein and proteinCsmall molecule interactions. Our computational algorithms have been extensively tested and used in a number of published works for predicting proteins of specific structural or functional classes, proteinCprotein interactions, peptides of specific functions and quantitative structure activity relationships of small molecules. PROFEAT is accessible free of charge at http://bidd.cz3.nus.edu.sg/cgi-bin/prof/protein/profnew.cgi.

  9. Identification of next generation innovative therapeutic targets by genetic, structural, physicochemical and system profile of successful targets
  10. Low target discovery rate has been linked to inadequate consideration of multiple factors that collectively contribute to druggability. These factors include sequence, structural, physicochemical, and systems profiles. Methods individually exploring each of these profiles for target identification have been developed, but they have not been collectively used. We evaluated the collective capability of these methods in identifying promising targets from 1019 research targets based on the multiple profiles of up to 348 successful targets. The collective method combining at least three profiles identified 50, 25, 10, and 4% of the 30, 84, 41, and 864 phase III, II, I, and nonclinical trial targets as promising, including eight to nine targets of positive phase III results. This method dropped 89% of the 19 discontinued clinical trial targets and 97% of the 65 targets failed in high-throughput screening or knockout studies. Collective consideration of multiple profiles demonstrated promising potential in identifying innovative targets.

  11. Analysis of mechanisms of drug combinations from interaction and network perspectives
  12. Understanding the molecular mechanisms underlying synergistic, potentiative and antagonistic effects of drug combinations could facilitate the discovery of novel efficacious combinations and multi-targeted agents. In this article, we describe an extensive investigation of the published literature on drug combinations for which the combination effect has been evaluated by rigorous analysis methods and for which relevant molecular interaction profiles of the drugs involved are available. Analysis of the 117 drug combinations identified reveals general and specific modes of action, and highlights the potential value of molecular interaction profiles in the discovery of novel multicomponent therapies.

  13. Homology-free prediction of functional class of proteins and peptides by Support Vector Machines (SVM)
  14. Protein and peptide sequences contain clues for functional prediction. A challenge is to predict sequences that show low or no homology to proteins or peptides of known function. A machine learning method, support vector machines (SVM), has recently been explored for predicting functional class of proteins and peptides from sequence-derived properties irrespective of sequence similarity, which has shown impressive performance for predicting a wide range of protein and peptide classes including certain low- and non- homologous sequences. This method serves as a new and valuable addition to complement the extensively-used alignment-based, clustering-based, and structure-based functional prediction methods. This article evaluates the strategies, current progresses, reported prediction performances, available software tools, and underlying difficulties in using SVM for predicting the functional class of proteins and peptides.

  15. Analysis on the trends of anticancer targets exploration and the strategies used to enhance the efficacy of drug targeting
  16. A number of therapeutic targets have been explored for developing anticancer drugs. Continuous efforts have been directed at the discovery of new targets as well as the improvement of therapeutic efficacy of agents directed at explored targets. There are 84 and 488 targets of marketed and investigational drugs for the treatment of cancer or cancer related illness. Analysis of these targets, particularly those of drugs in clinical trials and US patents, provides useful information and perspectives about the trends, strategies and progresses in targeting key cancer-related processes and in overcoming the difficulties in developing efficacious drugs against these targets. The efficacy of anticancer drugs directed at these targets is frequently compromised by counteractive molecular interactions and network crosstalk, negative and adverse secondary effects of drugs, and undesired ADMET profiles. Multi-component therapies directed at multiple targets and improved drug targeting methods are being explored for alleviating these efficacy-reducing processes. Investigation of the modes of actions of these combinations and targeting methods offers clues to aid the development of more effective anticancer therapies.
IDRB: Databases

Our experiences on database construction have led to several bioinformatics and pharmacoinformatics databases as listed below:

TTD: Therapeutic Target Database

    Database URL: https://db.idrblab.org/ttd/

    Extensive efforts have been directed at the discovery, investigation and clinical monitoring of targeted therapeutics. These efforts may be facilitated by the convenient access of the genetic, proteomic, interactive and other aspects of the therapeutic targets. Therefore, we developed the Therapeutic Target Database (TTD) to provide information about known and explored therapeutic protein and nucleic acid targets, the targeted disease, pathway information and the corresponding drugs directed at each of these targets. TTD was known to be one of the most popular pharmaceutical databases around the world, and included the links to relevant databases containing information about target function, sequence, 3D structure, ligand binding properties, enzyme nomenclature and drug structure, therapeutic class, and clinical development status.

    Our Publication(s) Describing This Database:

  1. Y. H. Li, C. Y. Yu, X. X. Li, P. Zhang, J. Tang, Q. X. Yang, T. T. Fu, X. Y. Zhang, X. J. Cui, G. Tu, Y. Zhang, S. Li, F. Y. Yang, Q. Sun, C. Qin, X. Zeng, Z. Chen, Y. Z. Chen*, F. Zhu*. Therapeutic target database update 2018: enriched resource for facilitating bench-to-clinic research of targeted therapeutics. Nucleic Acids Res (impact factor of the publication year: 11.561, 生物一区 TOP 期刊). 46(D1): 1121-1127 (2018).
  2. ESI Highly Cited Paper:
    • The Percentile in Subject Area shown in InCites™ was 0.16% in 2019.
    Highlights by Experts in Subject Area:
    • Introduced by OMICTOOLS as "useful for facilitating patient focused research, discovery and clinical investigations of the targeted therapeutics".

  3. H. Yang, C. Qin, Y. H. Li, L. Tao, J. Zhou, C. Y. Yu, F. Xu, Z. Chen, F. Zhu*, Y. Z. Chen*. Therapeutic target database update 2016: enriched resource for bench to clinical drug target and targeted pathway information. Nucleic Acids Res (impact factor of the publication year: 9.202, 生物一区 TOP 期刊). 44(D1): 1069-1074 (2016).
  4. ESI Highly Cited Paper:
    • The Percentile in Subject Area shown in InCites™ was 0.66% in 2019.
    • The Percentile in Subject Area shown in InCites™ was 0.71% in 2018.
    • The Percentile in Subject Area shown in InCites™ was 0.87% in 2017.

  5. F. Zhu, Z. Shi, C. Qin, L. Tao, X. Liu, F. Xu, L. Zhang, Y. Song, X. H. Liu, J. X. Zhang, B. C. Han, P. Zhang, Y. Z. Chen*. Therapeutic target database update 2012: a resource for facilitating target-oriented drug discovery. Nucleic Acids Res (impact factor of the publication year: 8.026, 生物一区 TOP 期刊). 40(D1): 1128-1136 (2012).
  6. ESI Highly Cited Paper:
    • The Percentile in Subject Area shown in InCites™ was 0.60% in 2019.
    • The Percentile in Subject Area shown in InCites™ was 0.62% in 2018.
    • The Percentile in Subject Area shown in InCites™ was 0.31% in 2017.

    Highlights by Experts in Subject Area:

    • "FACULTYof1000" as "the top 2% of published articles in biology and medicine" and "a most useful resource for scientists and companies working on drug discovery and validation, drug lead discovery and optimization, and the development of multi-target drugs and drug combinations".
    • Prof. Chris Southan in his blog as "Therapeutic Target Database in PubChem".

  7. F. Zhu, B. C. Han, P. Kumar, X. H. Liu, X. H. Ma, X. N. Wei, L. Huang, Y. F. Guo, L. Y. Han, C. J. Zheng, Y. Z. Chen*. Update of TTD: therapeutic target database. Nucleic Acids Res (impact factor of the publication year: 7.479, 生物一区 TOP 期刊). 38(D1): 787-791 (2010).
  8. ESI Highly Cited Paper:
    • The Percentile in Subject Area shown in InCites™ was 2.95% in 2017.

VARIDT: VARIability of Drug Transporter Database

    Database URL: https://db.idrblab.org/varidt/

    The absorption, distribution and excretion of drugs are largely determined by their transporters (DTs), the variability of which has thus attracted considerable attention. There are three aspects of variability: epigenetic regulation and genetic polymorphism, species/tissue/disease-specific DT abundances, and exogenous factors modulating DT activity. The variability data of each aspect are essential for clinical study, and a collective consideration among multiple aspects becomes essential in precision medicine. However, no database is constructed to provide the comprehensive data of all aspects of DT variability. Herein, the Variability of Drug Transporter Database (VARIDT) was introduced to provide such data. First, 177 and 146 DTs were confirmed, for the first time, by the transporting drugs approved and in clinical/preclinical, respectively. Second, for the confirmed DTs, VARIDT comprehensively collected all aspects of their variability (23,947 DNA methylations, 7,317 noncoding RNA/histone regulations, 1,278 genetic polymorphisms, differential abundance profiles of 257 DTs in 21,781 patients/healthy individuals, expression of 245 DTs in 67 tissues of human/model organism, 1,225 exogenous factors altering the activity of 148 DTs), which allowed mutual connection between any aspects. Due to huge amount of accumulated data, VARIDT made it possible to generalize characteristics to reveal disease etiology and optimize clinical treatment, and is freely accessible at: https://db.idrblab.org/varidt/.

    Our Publication(s) Describing This Database:

  1. J. Y. Yin, W. Sun, F. C. Li, J. J. Hong, X. X. Li, Y. Zhou, Y. J. Lu, M. Z. Liu, X. Zhang, N. Chen, X. P. Jin, J. Xue, S. Zeng*, L. S. Yu*, F. Zhu*. VARIDT 1.0: variability of drug transporter database. Nucleic Acids Res (impact factor of the publication year: 11.147, 生物一区 TOP 期刊). doi: 10.1093/nar/gkz779 (2019).
IDRB: Softwares

Our experiences on software and server development have led to several bioinformatics and pharmacoinformatics servers as listed below:

NOREVA: NORmalization and EVAluation of MS-based metabolomics data

    Server URL: https://idrblab.org/noreva/

    Diverse forms of unwanted signal variations in mass spectrometry-based metabolomics data adversely affect the accuracies of metabolic profiling. A variety of normalization methods have been developed for addressing this problem. However, their performances vary greatly and depend heavily on the nature of the studied data. Moreover, given the complexity of the actual data, it is not feasible to assess the performance of methods by single criterion. We therefore developed NOREVA to enable performance evaluation of various normalization methods from multiple perspectives. NOREVA integrated five well-established criteria (each with a distinct underlying theory) to ensure more comprehensive evaluation than any single criterion. It provided the most complete set of the available normalization methods, with unique features of removing overall unwanted variations based on quality control metabolites and allowing quality control samples based correction sequentially followed by data normalization. The originality of NOREVA and the reliability of its algorithms were extensively validated by case studies on five benchmark datasets. In sum, NOREVA is distinguished for its capability of identifying the well performed normalization method by taking multiple criteria into consideration and can be an indispensable complement to other available tools. NOREVA can be freely accessed at http://server.idrb.cqu.edu.cn/noreva/.

    Our Publication(s) Describing This Server:

  1. B. Li, J. Tang, Q. X. Yang, S. Li, X. J. Cui, Y. H. Li, Y. Z. Chen, W. W. Xue, X. F. Li, F. Zhu*. NOREVA: normalization and evaluation of MS-based metabolomics data. Nucleic Acids Res (impact factor of the publication year: 10.162, 生物一区 TOP 期刊). 45(W1): 162-170 (2017).
  2. ESI Highly Cited Paper:
    • The Percentile in Subject Area shown in InCites™ was 1.27% in 2019.
    • The Percentile in Subject Area shown in InCites™ was 2.98% in 2018.
    Highlights by Experts in Subject Area:
    • Introduced by OMICTOOLS as "provided valuable guidance to the selection of suitable algorithm in metabolomics".
    • Discussed in StackExchange as "works fine" and "corrections for batches without QC options".

  3. B. Li, J. Tang, Q. X. Yang, X. J. Cui, S. Li, S. J. Chen, Q. X. Cao, W. W. Xue, N. Chen, F. Zhu*. Performance evaluation and online realization of data-driven normalization methods used in LC/MS based untargeted metabolomics analysis. Sci Rep (impact factor of the publication year: 5.228, 综合性二区期刊). 6: 38881 (2016).
  4. ESI Highly Cited Paper:
    • The Percentile in Subject Area shown in InCites™ was 2.05% in 2019.

ANPELA: ANalysis and PErformance-assessment of the LAbel-free proteome quantification

    Server URL: https://idrblab.org/anpela/

    Label-free quantification (LFQ) with a specific and sequentially integrated workflow of acquisition technique, quantification tool and processing method has emerged as the popular technique employed in metaproteomic research to provide a comprehensive landscape of the adaptive response of microbes to external stimuli and their interactions with other organisms or host cells. The performance of a specific LFQ workflow is highly dependent on the studied data. Hence, it is essential to discover the most appropriate one for a specific data set. However, it is challenging to perform such discovery due to the large number of possible workflows and the multifaceted nature of the evaluation criteria. Herein, a web server ANPELA (https://idrblab.org/anpela/) was developed and validated as the first tool enabling performance assessment of whole LFQ workflow (collective assessment by five well-established criteria with distinct underlying theories), and it enabled the identification of the optimal LFQ workflow(s) by a comprehensive performance ranking. ANPELA not only automatically detects the diverse formats of data generated by all quantification tools but also provides the most complete set of processing methods among the available web servers and stand-alone tools. Systematic validation using metaproteomic benchmarks revealed ANPELA's capabilities in 1 discovering well-performing workflow(s), (2) enabling assessment from multiple perspectives and (3) validating LFQ accuracy using spiked proteins. ANPELA has a unique ability to evaluate the performance of whole LFQ workflow and enables the discovery of the optimal LFQs by the comprehensive performance ranking of all 560 workflows. Therefore, it has great potential for applications in metaproteomic and other studies requiring LFQ techniques, as many features are shared among proteomic studies.

    Our Publication(s) Describing This Server:

  1. J. Tang, J. B. Fu, Y. X. Wang, Y. C. Luo, Q. X. Yang, B. Li, G. Tu, J. J. Hong, X. J. Cui, Y. Z. Chen, L. X. Yao, W. W. Xue, F. Zhu*. Simultaneous improvement in the precision, accuracy and robustness of label-free proteome quantification by optimizing data manipulation chains. Mol Cell Proteomics (impact factor of the publication year: 5.236, 生物二区期刊). 18(8): 1683-1699 (2019).
  2. J. Tang, J. B. Fu, Y. X. Wang, B. Li, Y. H. Li, Q. X. Yang, X. J. Cui, J. J. Hong, X. F. Li, Y. Z. Chen, W. W. Xue, F. Zhu*. ANPELA: analysis and performance-assessment of the label-free quantification workflow for metaproteomic studies. Brief Bioinform (impact factor of the publication year: 9.101, 生物一区 TOP 期刊). doi: 10.1093/bib/bby127 (2019).
CNN-T4SE: CNN-based annotation of bacterial Type IV Secretion system Effectors

    Server URL: https://idrblab.org/cnnt4se/

    The type IV bacterial secretion system (SS) is reported to be one of the most ubiquitous SSs in nature, and can induce serious conditions by secreting type IV SS effectors (T4SEs) into the host cells. Recent studies mainly focus on annotating new T4SE from the huge amount of sequencing data, and various computational tools are therefore developed to accelerate T4SE annotation. However, these tools are reported as heavily dependent on the selected methods and their annotation performance need to be further enhanced. Herein, a convolution neural network (CNN) technique was used to annotate T4SEs by integrating multiple protein encoding strategies. First, the annotation accuracies of nine encoding strategies integrated with CNN were assessed and compared with that of the popular T4SE annotation tools based on independent benchmark. Second, false discovery rates (FDRs) of various models were systematically evaluated by (1) scanning the genome of Legionella pneumophila subsp. ATCC 33152 and (2) predicting the real-world non-T4SEs validated using published experiments. Based on above analyses, the encoding strategies, (a) position-specific scoring matrix (PSSM), (b) protein secondary structure & solvent accessibility (PSSSA) and (c) one-hot encoding scheme (Onehot), were identified as well-performing when integrated CNN. Finally, a novel strategy that collectively considering the three well-performing models (CNN-PSSM, CNN-PSSSA and CNN-Onehot) was proposed, and a new tool (CNN-T4SE, https://idrblab.org/cnnt4se/) was constructed to facilitate T4SE annotation. All in all, this study conducted a comprehensive analysis on the performance of a collection of encoding strategies when integrated with CNN, which could facilitate the suppression of T4SS in infection and limit the spread of antimicrobial resistance.

    Our Publication(s) Describing This Server:

  1. J. J. Hong, Y. C. Luo, M. J. Mou, J. B. Fu, Y. Zhang, W. W. Xue, T. Xie, L. Tao*, Y. Lou*, F. Zhu*. Convolutional neural network-based annotation of bacterial type IV secretion system effectors with enhanced accuracy and reduced false discovery. Brief Bioinform (impact factor of the publication year: 9.101, 生物一区 TOP 期刊). doi: 10.1093/bib/bbz120 (2019).
PROFEAT: SVM-based Protein functional family prediction

    Server URL: https://idrblab.org/profeat/

    The studies of biological, disease, and pharmacological networks are facilitated by the systems-level investigations using computational tools. In particular, the network descriptors developed in other disciplines have found increasing applications in the study of the protein, gene regulatory, metabolic, disease, and drug-targeted networks. Facilities are provided by the public web servers for computing network descriptors, but many descriptors are not covered, including those used or useful for biological studies. We upgraded the PROFEAT web server http://bidd2.nus.edu.sg/cgi-bin/profeat2016/main.cgi for computing up to 329 network descriptors and protein-protein interaction descriptors. PROFEAT network descriptors comprehensively describe the topological and connectivity characteristics of unweighted (uniform binding constants and molecular levels), edge-weighted (varying binding constants), node-weighted (varying molecular levels), edge-node-weighted (varying binding constants and molecular levels), and directed (oriented processes) networks. The usefulness of the network descriptors is illustrated by the literature-reported studies of the biological networks derived from the genome, interactome, transcriptome, metabolome, and diseasome profiles.

    Our Publication(s) Describing This Server:

  1. H. B. Rao&, F. Zhu&, G. B. Yang, Z. R. Li*, Y. Z. Chen. Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequences. Nucleic Acids Res (impact factor of the publication year: 7.836, 生物一区 TOP 期刊). 39(W1): 385-390 (2011).
SVMProt: SVM-based Protein functional family prediction

    Server URL: https://idrblab.org/svmprot/

    Knowledge of protein function is important for biological, medical and therapeutic studies, but many proteins are still unknown in function. There is a need for more improved functional prediction methods. Our SVM-Prot web-server employed a machine learning method for predicting protein functional families from protein sequences irrespective of similarity, which complemented those similarity-based and other methods in predicting diverse classes of proteins including the distantly-related proteins and homologous proteins of different functions. Since its publication in 2003, we made major improvements to SVM-Prot with (1) expanded coverage from 54 to 192 functional families, (2) more diverse protein descriptors protein representation, (3) improved predictive performances due to the use of more enriched training datasets and more variety of protein descriptors, (4) newly integrated BLAST analysis option for assessing proteins in the SVM-Prot predicted functional families that were similar in sequence to a query protein, and (5) newly added batch submission option for supporting the classification of multiple proteins. Moreover, 2 more machine learning approaches, K nearest neighbor and probabilistic neural networks, were added for facilitating collective assessment of protein functions by multiple methods. SVM-Prot can be accessed at http://bidd2.nus.edu.sg/cgi-bin/svmprot/svmprot.cgi.

    Our Publication(s) Describing This Server:

  1. Y. H. Li, J. Y. Xu, L. Tao, X. F. Li, S. Li, X. Zeng, S. Y. Chen, P. Zhang, C. Qin, C. Zhang, Z. Chen, F. Zhu*, Y. Z. Chen. SVM-Prot 2016: a web-server for machine learning prediction of protein functional families from sequence irrespective of similarity. PLoS ONE (impact factor of the publication year: 3.234, 生物三区期刊). 11(8): e0155290 (2016).
MMEASE: Meta-Metabolomics by Enhanced Annotation, marker Selection and Enrichment

    Server URL: https://idrblab.org/mmease/

    Large-scale and long-term metabolomic studies have attracted widespread attention in the biomedical studies yet remain challenging despite recent technique progresses. In particular, the ineffective way of experiment integration and limited capacity in metabolite annotation are known issues. Herein, we constructed an online tool MMEASE enabling the integration of multiple analytical experiments with an enhanced metabolite annotation and enrichment analysis (https://idrblab.org/mmease/). MMEASE was unique in capable of (1) integrating multiple analytical blocks; (2) providing enriched annotation for >330 thousands of metabolites; (3) conducting enrichment analysis using various categories/sub-categories. All in all, MMEASE aimed at supplying a comprehensive service for long-term and large-scale metabolomics, which might provide valuable guidance to current biomedical studies.

    Our Publication(s) Describing This Server: (to be available)

SSIZER: Assessment and Determination of Sample Size Required for Biological Studies

    Server URL: https://idrblab.org/ssizer/

    Comparative biomedical studies typically require plenty of samples to achieve statistically significant analysis. A frequently-encountered question is how many samples are sufficient for a particular study. This question has been traditionally assessed using the statistical power, but this assessment alone may not guarantee the full and reproducible discovery of markers truly discriminating biological groups (BMC Bioinformatics. 11: 447, 2010; Nat Rev Neurosci. 14: 365-76, 2013). Two novel types of statistical indexes have thus been introduced to assess the sample size from different perspectives by considering the diagnostic accuracy (Metabolomics. 9: 280-99, 2013) and robustness (Cancer Res. 74: 4612-21, 2014). Due to the complementary nature of these index-types, a comprehensive evaluation based on all types of indexes is necessary for more accurate assessment. However, no such tool is available yet. Herein, an online tool SSizer was developed and validated to enable the assessment of the sufficiency of a user-input biomedical dataset for given studies, and three index-types were provided for the first time to achieve the comprehensive assessment. These indexes included: (I) statistical power analyzing the level of difference between two comparative groups (Radiology. 227: 309-13, 2003), (II) overall diagnostic & classification accuracies on independent data (Metabolomics. 9: 280-99, 2013), and (III) robustness among the lists of biomarkers identified from different datasets (Cancer Res. 74:4612-21, 2014). Moreover, a sample simulation based on user-input data was performed to expand data and then determine the sample size required for given study (Anal Chem. 88: 5179-88, 2016). In sum, SSizer was unique for its capacity in comprehensively evaluating whether sample size was sufficient and determining the required number of samples for user-input dataset, which can therefore facilitate current biomedical studies including metabolomics, proteomics, and so on. SSizer is accessible free of charge at https://idrblab.org/ssizer/

    Our Publication(s) Describing This Server: (to be available)

MetaFS: Performance Assessment for Biomarker Discovery in Microbiome Studies

    Server URL: https://idrblab.org/metafs/

    Metaproteomic data suffer from two unavoidable issues: dimensionality and sparsity. Data reduction methods can maximally identify the relevant subset of significant differential features and reduce data redundancy. Feature selection (FS) approaches were often applied to obtain the significant differential subset. So far, a variety of feature selection have been developed for metaproteomic study. However, due to FS’s performance depended heavily on the data characteristics of a given research, the well-suitable feature selection method must be carefully chosen for obtaining the reliable and reproducibly results of analyses. Moreover, it is critical to evaluate the performances of each FS method according to comprehensive criteria, because single criterion is not sufficient to reflect the overall level of the FS method. Therefore, we constructed the online tool named MetaFS, which provided 13 types of FS methods and conduct the comprehensive evaluation on the complex FS methods using four widely accepted and independent criteria. Furthermore, the function and reliability of MetaFS were systematically tested and validated via two case studies. In summary, MetaFS could be a distinguished tool discovering the overall well-performed FS method for selecting the potential biomarkers in microbiome studies. The online tool is freely available at https://idrblab.org/metafs/.

    Our Publication(s) Describing This Server: (to be available)

IDRB: Innovative Drug Research and Bioinformatics Group

All rights are reserved by: Innovative Drug Research and Bioinformatics Group (IDRB)
College of Pharmaceutical Sciences, Zhejiang University
Hangzhou, P.R. China, 310058.
Contact number: (86 - 571)88208444

Last Update: