Big Data Inventory
The International Federation of Musculoskeletal Research Societies (IFMRS) established the IFMRS Big Data Work Group in 2013 to assist members in accessing and utilizing existing databases for musculoskeletal research.
The means to accomplish this goal include the following:
- Provide an online list of websites and how to access them.
- Assimilate a list of databases categorized according to disease, tissue, cells, mouse models, patient and population data.
- Identify educational opportunities and training for members.
- Identify ways to promote collaboration.
- Identify potential support and funding for collaborative projects.
As a first step, an inventory of websites has been generated to assist members in accessing and utilizing existing databases for musculoskeletal research. The IFMRS Big Data Website Inventory is divided into the following four categories:
- Transcriptomics: Coding
- Transcriptomics: Non-coding RNA
This inventory will continue to be modified with new information on a regular basis.
The next step in the project will be to assimilate a list of databases categorized according to disease, tissue, cells, mouse models, and patient and population data. Categories may include basic research databases such as cell lines, translational research data bases such as animal models, and clinical research data bases to include GWAS for diseases such as osteoporosis, sarcopenia and osteoarthritis.
NOTE: The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.
This is the major public resource of human epigenomic data. Data sets include stem cells and primary ex vivo tissues, representing the normal counterparts of tissues and organ systems frequently involved in human diseases. Built around next-generation sequencing technologies, the available data types include DNA methylation, histone modification, chromatin accessibility and small RNA transcripts.
This database includes human reference epigenomes and the results of their integrative and comparative analyses.
This is a web-based portal and supports the navigation of the Human Epigenome Atlas (above) and its interactive visualization, integration, comparison and analysis.
This is integrated into the UCSC Genome Browser.
This is the data repository site, allowing epigenomic data exploration, viewing and download for a diverse collection of data sets.
This is a comprehensive database for human histone modification. It incorporates >40 location-specific histone modifications in human and provides resources of histone modification regulation in multiple human cancer types. It also includes a genome-wide visual tool for viewing the histone modification data in the context of existing genomic annotations.
The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.
This is a public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted. Tools are provided to help users query and download experiments and curated gene expression profiles.
The Cancer Genome Atlas (TCGA) Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA.
This database is an archive of studies that have investigated the association between genotype and phenotype. dbGaP will serve as the NIH GWAS data repository.
GWAS Central provides a centralized compilation of summary level findings from genetic association studies, both large and small. We actively gather datasets from public domain projects, and encourage direct data submission from the community.
Database of Genotypes and Phenotypes developed by the National Center for Biotechnology Information (a division of the National Library of Medicine of the NIH) in accordance with (http://gds.nih.gov/), to archive and share the results of studies that have investigated the association between genotype and phenotype. dbGaP serves as the NIH GWAS data repository. (Text sourced from NCBI)
Authorized access to the website housing the real data suitable for follow-up analyses of associations between genotypes and phenotypes is located at: https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login
Additional information on dbGaP can be found at: http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap
Quality controlled, manually curated, literature-derived collection of all published GWAS assaying >100,000 SNPs and all SNP-trait associations with p-values < 1.0 × 10-5. Allows searches and queries by journal, first author, trait, chromosomal region, gene, SNP, odds ratio and p-value threshold. The catalogue provides the iconic GWAS ideogram of all SNP-trait associations, with p-values ≤ 5.0 × 10-8 (http://www.ebi.ac.uk/fgpt/gwas/)
Hindorff LA1, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009 Jun 9;106(23):9362-7.
Interface for searching, filtering, and visualizing the results from GWAS meta-analyses of multiple musculoskeletal phenotypes. MySQL HIPAA-compliant relational database of aggregate statistics and functional links to online public sources (such as NCBI Entrez, UCSC genome browser etc.). Special focus on pleiotropy assessing common associations between aging traits.
GRASP v1.0 is a deeply extracted database of GWAS results; contains 46.2 million SNP-phenotype association from among 1390 GWAS studies. GWAS results are re-annotated with 16 annotation sources including RNA-editing sites, lincRNAs, PTMs. Among the phenotypes, are: eQTLs (71.5%), metabolite QTLs (21.2%), methylation QTLs (4.4%), diseases, biomarkers and other traits (2.8%).
Leslie R, O’Donnell CJ, Johnson AD. GRASP: analysis of genotype-phenotype results from 1390 genome-wide association studies and corresponding open access database. Bioinformatics 2014;30(12):i185-i194
GWASdb combines collections of traits/diseases associated SNP (TASs) from current GWAS and their comprehensive functional annotations, as well as disease classifications. The database provides following functions: (i) In addition to all the TASs that attained genome-wide significance (P-value < 5 x 10-8), manually curated the TASs that are marginally significant (P-value < 10-3); (ii) Extensive functional annotations and predictions for those TASs across multiple domains; (iii) Manually mapped TASs by phenotype according to Disease Ontology (DO) and Human Phenotype Ontology (HPO).
Li MJ, Wang P, Liu X, et al. GWASdb: a database for human genetic variants identified by genome-wide association studies. Nucleic Acids Res. Jan 2012;40(Database issue):D1047-1054
Online resource for clinicians and researchers interested in the genetics of osteoporosis. Provides updated and comprehensive list of validated associations between gene polymorphisms and osteoporosis traits. based on STREGA (STrengthening the REporting of Genetic Association Studies) criteria. Links to NCBI dbSNP and Gene resources are embedded.
Largest international collaboration of research groups worldwide studying the genetic basis of osteoporosis. GEFOS applies meta-analysis of GWAS with replication in follow-up studies, to identify common risk gene variants forosteoporosis. The full length of results of the GWAS meta-analyses are released to the public domain for subsequent mining and facilitation of new investigations using the data.
Estrada K, Styrkarsdottir U, Evangelou E, et al. Genome-wide meta-analysis identifies 56 bone mineral density loci and reveals 14 loci associated with risk of fracture. Nat Genet. 2012;44(5):491-501
International collaboration aiming to provide a catalogue of common human genetic variation. All of the data generated by the project, including SNP frequencies, genotypes and haplotypes, are placed in the public domain and are available for download. Main use of the HapMap catalogue is as reference for imputation of GWAS datasets for multiple different ethnic groups around the world.
The International HapMap Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52-58. 2010.
Catalogue of human rare genetic variation from whole-genome sequencing of ~2000 individuals from 20 populations worldwide; containing close to 80 million variants. Includes SNPs and other types of structural variation. Catalogue typically used as reference for imputation of GWAS datasets.
The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 2010, 467: 1061-1073
Includes SNPs, microsatellites, and small-scale insertions and deletions providing population-specific frequency and genotype data, experimental conditions, molecular context, and mapping information for both neutral variations and clinical mutations.
Human Genome Structural Variation white paper
Flexible web-based open source database designed to collect and display variants in the DNA sequence. Focus on classifying deleterious potential (i.e. a disease-causing variant or mutation), and used to diagnose and advise patients carrying a genetic disease. Patient information is usually only accessible for registered users. Contains more than 124,000 unique variants in 5175 genes from 162,000 patients.
Fokkema IF, Taschner PE, Schaafsma GC, Celli J, Laros JF, den Dunnen JT (2011). LOVD v.2.0: the next generation in gene variant databases. Hum Mutat. 2011 May;32(5):557-63.
Integrated database developed by the Study Groups of Millennium Genome Project (MGP) with emphasis on Alzheimer’s disease, Cancer, Diabetes Mellitus, Hypertension, Bronchial Asthma and Pharmacogenetics (SGMGP). Contains polymorphism information at JSNP and other SNPs, including those discovered in MGP can be queried by gene, disease or proteome database.
This database integration program has been going for arrangement data of life science as integrated database enterprise of JST. GWAS-DB promotes to manage the genome-wide association analysis data permanently and to research disease gene by sharing information among researchers. That project is a subsequent PJ of “Integrated DB PJ” in MEXT and has been going for three years from 2011.
GWAS-DB promotes to manage the genome-wide association analysis data and to research disease gene by sharing information among researchers. Holds databases on Human variation, HLS, SNP control, CNV control, CNV association and Re-sequencing resources.
Koike, et al., Genome-wide association database developed in the Japanese Integrated Database Project, J. Hum. Genet. (2009) 54, 543-546.
Tool for exploring annotations of the noncoding genome at variants on haplotype blocks, such as candidate regulatory SNPs at disease-associated loci. Using ENCODE experiments on gene regions, linked SNPs and small indels can be visualized along with their predicted chromatin state, their sequence conservation across mammals, and their effect on regulatory motifs.
HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. (PMID:22064851).
Tool to identify DNA features and regulatory elements in non-coding regions of the human genome using data from ENCODE and GEO. Tool annotates SNPs with known and predicted regulatory elements in the intergenic regions. Known and predicted regulatory DNA elements include regions of DNAase hypersensitivity, binding sites of transcription factors, and promoter regions that have been biochemically characterized to regulation transcription.
Boyle AP, Hong EL, Hariharan M, Cheng Y, Schaub MA, Kasowski M, Karczewski KJ, Park J, Hitz BC, Weng S, Cherry JM, Snyder M. Annotation of functional variation in personal genomes using RegulomeDB. Genome Research 2012, 22(9):1790-1797. PMID: 22955989.
EO is the largest public repository for functional genomics data. This database contains MAIME-compliant array- and sequence-based data. Tools are provided to help users query and download experiments and curated gene expression profiles. GEO contains more than 2500 dataset entries for the MeSH term “osteoblast” and more than 600 dataset entries for the term “chondrocyte”. A wide array of data types (microarray, RNA-seq, ChIP-seq) are available for bone-realted tissues and cell lines. In addition, numerous experiments evaluate knockout models and cytokine and drug treatments.
Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A. NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Res. 2013 Jan;41(Database issue):D991-5.
The ENCODE Consortium is an international collaboration of research groups funded by NHGRI. The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active. From the UCSC Genome Browser website there are numerous genomic and transcriptome data sets available for download or on-site visualization/analysis. Several osteoblast datasets are currently available, however other bone-associated lineages are poorly represented.
UCSC Genome Browser: Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Res. 2002 Jun;12(6):996-1006.
The Genotype-Tissue Expression project (GTEx) aims to create a comprehensive public atlas of gene expression and regulation across multiple human tissues. The resource will provide valuable insights in to the mechanisms of gene regulation, aid in the interpretation of genome wide association studies, and enable studies of expression quantitative trait loci (eQTLs), alternative splicing, and the tissue speciﬁcity of gene regulatory mechanisms. As of this time there is no specific data for bone-related tissues, however adipose and skeletal muscle are represented.
Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, Hasz R, Walters G, Garcia F, Young N, Foster B, Moser M, Karasik E, Gillard B, Ramsey K, Sullivan S, Bridge J, Magazine H, Syron J, et al The Genotype-Tissue Expression (GTEx) project
Nat Genet. 2013 May 29;45(6):580-5. doi: 10.1038/ng.2653. PMID: 23715323
Ensembl is a joint scientific project between the European Bioinformatics Institute and the Wellcome Trust Sanger Institute, and aims to provide a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of vertebrates and other eukaryotic organisms. Ensembl functions as a full-featured resource similar to NCBI or UCSC Genome Browser, providing reference annotations, genome visualization tools and experimental data from numerous models and tissues.
Paul Flicek, M. Ridwan Amode, Daniel Barrell, Kathryn Beal, Konstantinos Billis, Simon Brent, Denise Carvalho-Silva, Peter Clapham, Guy Coates, Stephen Fitzgerald et al Ensembl 2014 Nucleic Acids Research 2014 42 Database issue:D749-D755 doi: 10.1093/nar/gkt1196
BioMart data-mining tool – Large datasets can also be retrieved using the BioMart data-mining tool (which provides a web interface for downloading datasets using complex queries. (Text sourced from Ensembl and Wikipedia websites)
Kasprzyk A. BioMart: driving a paradigm change in biological data management. Database (Oxford). 2011 Nov 13;2011:bar049. doi: 10.1093/database/bar049. Print 2011. PubMed PMID: 22083790; PubMed Central PMCID: PMC3215098.
The GeneCards human gene database extracts and integrates a carefully selected subset of gene related transcriptomic, genetic, proteomic, functional and disease information, from dozens of relevant sources. It provides robust user-friendly access to up-to-date knowledge. GeneCards overcomes barriers of data format heterogeneity, and uses standard nomenclature and approved gene symbols. GeneCards presents a complete summary for each gene, and provides the means to obtain a deep understanding of biology and medicine. Information is featured in 20 GeneCards sections.
A free extensible and customizable gene annotation portal. This gene-centric resource provides resources for learning about gene and protein function. Specific datasets include limited numbers of osteoarthritis models (3), osteoblast models (4), osteosarcoma models (7) and chondrocyte models (4).
Wu C, Orozco C, Boyer J, Leglise M, Goodale J, Batalov S, Hodge CL, Haase J, Janes J, Huss JW 3rd, Su AI (2009) BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biol.10(11):R130.
FANTOM is an international research consortium established by Dr. Hayashizaki and his colleagues in 2000 to assign functional annotations to the full-length cDNAs that were collected during the Mouse Encyclopedia Project at RIKEN. FANTOM has since developed and expanded over time to encompass the fields of transcriptome analysis. The object of the project is moving steadily up the layers in the system of life, progressing thus from an understanding of the ‘elements’ – the transcripts – to an understanding of the ‘system’ – the transcriptional regulatory network, in other words the ‘system’ of an individual life form.
Kasprzyk A. BioMart: driving a paradigm change in biological data management. Database (Oxford). 2011 Nov 13;2011:bar049. doi: 10.1093/database/bar049. Print 2011. PubMed PMID: 22083790; PubMed Central PMCID: PMC3215098.
Website of the bioinformatics section of the department of genetics of the university medical hospital in Groningen (UMCG), the Netherlands providing software tools and background information for the analysis of gene expression datasets, data quality control and gene network prioritization tools. Containing five tools:
- Gene Network: Access functional predictions for genes based on an 80,000-sample gene co-regulation network.
- MixupMapper: a tool that uses eQTLs to predict genotypes from gene expression levels. These predicted genotypes can subsequently be used to identify sample mix-ups by comparing the predicted genotypes to the real genotypes of samples.
- Prioritizer: software for the identification of causative alleles from GWAS results. It predicts possible candidate genes from a Bayesian co-expression network on the basis of the assumption that causative genes are functionally related.
- TriTyper: software for the detection and imputation of tri-allelic SNPs (SNPs with for example indels).
- Blood eQTL Browser: for mining of Results from a large scale blood eQTL meta-analysis (both cis and trans-eQTLs)
Delete Transcriptomics: Non-Coding RNA
miRBase is a frequently updated repository for all miRNA information. Each miRNA entry is species specific and includes old nomenclature as well as the current standard and official symbol. The genomic location, sequence, stem-loop structure, and brief description of miRNA identification and known biological functions are described with associated references and links to external databases. Importantly, for each mature miRNA processed for the pre-miRNA stem-loop, miRBase provides links to both validated and predicted target databases.
Kozomara A, Griffiths-Jones S. miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res. 2011 Jan;39(Database issue):D152-7. doi: 10.1093/nar/gkq1027. Epub 2010 Oct 30. PubMed PMID: 21037258; PubMed Central PMCID: PMC3013655.
DIANA Tools provides a robust set of search tools, software and databases for the analysis of expression regulation, miRNA regulatory elements and targets, and evaluation of ncRNAs in various diseases. Tools for miRNA target prediction include: microT/microT-CDS and TarBase. The mirPath and DIANA-mirExTra tools provide pathway analysis and expression data analysis functions that highlight ontology terms, which may be bone-associated categories.
I. S. Vlachos, N. Kostoulas, T. Vergoulis, G. Georgakilas, M. Reczko, M. Maragkakis, M. D. Paraskevopoulou, K. Prionidis, T. Dalamagas, A. G. Hatzigeorgiou DIANA miRPath v.2.0: investigating the combinatorial effect of microRNAs in pathways Nucleic Acids Research 2012 (Web server issue) PubMed
miRTarBase has accumulated more than fifty thousand miRNA-target interactions (MTIs), which are collected by manually surveying pertinent literature after data mining of the text systematically to filter research articles related to functional studies of miRNAs. Generally, the collected MTIs are validated experimentally by reporter assay, western blot, microarray and next-generation sequencing experiments.
Hsu SD, Tseng YT, Shrestha S, Lin YL, Khaleel A, Chou CH, Chu CF, Huang HY, Lin CM, Ho SY, Jian TY, Lin FM, Chang TH, Weng SL, Liao KW, Liao IE, Liu CC, Huang HD. miRTarBase update 2014: an information resource for experimentally validated miRNA-target interactions. Nucleic Acids Res. 2014 Jan;42(Database issue):D78-85. doi: 10.1093/nar/gkt1266. Epub 2013 Dec 4. PubMed PMID: 24304892; PubMed Central PMCID: PMC3965058.
TargetScan (v 6.2 June 2012) has 5 different species-specific versions, Human, Mouse, Worm, Fly, and Fish. The Human and Mouse versions allow for target searching across 10 species (Human, Mouse, Rat, Dog, Chicken, Chimpanzee, Rhesus, Cow, Opossum, and Frog). Targets are predicted based on seed match, species conservations, and surround RNA context. Target 3 UTRs with predicted miRNAs are output in a nice graphic with specific sequence information, conservation, and scores listed below.
Lewis BP, Burge CB, Bartel DP. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell. 2005 Jan 14;120(1):15-20. PubMed PMID: 15652477.
The microRNA.org website is a comprehensive resource of microRNA target predictions and expression profiles. Target predictions are based on a development of the miRanda algorithm which incorporates current biological knowledge on target rules and on the use of an up-to-date compendium of mammalian microRNAs. The target sites predicted by miRanda are scored for likelihood of mRNA downregulation using mirSVR, a regression model that is trained on sequence and contextual features of the predicted miRNA::mRNA duplex. Expression profiles are derived from a comprehensive sequencing project of a large set of mammalian tissues and cell lines of normal and disease origin.
starBase is designed for decoding Interaction Networks of lncRNAs, miRNAs, competing endogenous RNAs(ceRNAs), RNA-binding proteins (RBPs) and mRNAs from large-scale CLIP-Seq (HITS-CLIP, PAR-CLIP, iCLIP, CLASH) data and tumor samples (14 cancer types, >6000 samples). starBase can also decipher Protein-RNA and miRNA-target interactions, such as protein-lncRNA, protein-sncRNA, protein-mRNA, protein-pseudogene, miRNA-lncRNA, miRNA-mRNA, miRNA-circRNA, miRNA-pseudogene, miRNA-sncRNA interactions and ceRNA networks from 108 CLIP-Seq (HITS-CLIP, PAR-CLIP, iCLIP, CLASH) datasets.
starBase v2.0: decoding miRNA-ceRNA, miRNA-ncRNA and protein-RNA interaction networks from large-scale CLIP-Seq data. Li JH, Liu S, Zhou H, Qu LH* and Yang JH Nucleic Acids Res. 2014;42:D92-D97.
lncRNAdb is a database providing comprehensive annotations of eukaryotic long non-coding RNAs (lncRNAs).
Lncipedia.org is a novel integrated database of 32,183 human annotated lncRNA transcripts obtained from different sources. In addition to basic transcript information and structure, several statistics are calculated for each entry in the database, such as secondary structure information, protein coding potential and microRNA binding sites.
LNCipedia: a database for annotated human lncRNA transcript sequences and structures
Pieter-Jan Volders; Kenny Helsens; Xiaowei Wang; Bjorn Menten; Lennart Martens; Kris Gevaert; Jo Vandesompele; Pieter Mestdagh. Nucleic Acids Research 2012; doi: 10.1093/nar/gks915