Login | Register | Contact | Sitemap | Follow us in Facebook Follow us in Twitter Roche Institute RSS Youtube Channel - Roche Institute  Español |  English
Search:  

Biotechnology in the mirror
Bioinformatics

The EMBL-EBI: big science in bioinformatics
 

Louisa Wood, Cath Brooksbank, Graham Cameron and Janet Thornton, European Bioinformatics Institute, Hinxton, UK

In the last five years there have been spectacular improvements in the speed, capacity, and affordability of genome sequencing. High-throughput methods and the tools developed to extract meaningful findings from the data are revolutionising biology with impacts throughout the fields of medicine, agriculture and environmental science. These applications will enable humankind to gain a deeper understanding of human variability (www.1000genomes.org), unravel the links between genetic variation and disease (www.wtccc.org.uk), select for high yield and disease resistance in agricultural crops, and catalogue biodiversity (www.barcodinglife.org). The European Bioinformatics Institute (EMBL-EBI), part of the European Molecular Biology Laboratory, provides free biomolecular data resources of universal relevance to biological and medical research. Here, we present an overview of EMBL-EBI’s core data resources, providing integrated services and new data types to a growing and diversifying user base.

Serving life science researchers worldwide
EMBL-EBI’s ‘core’ databases (Figure 1) fall roughly into two categories: those describing the molecular components of biological systems (e.g. nucleotide and protein sequences, macromolecular structures and small molecules) and those describing their ‘behaviours’ or the outcomes of those behaviours (e.g. transcription, translation, and interaction). In addition to the core databases, EMBL-EBI hosts a large number of specialist data resources. For the first time, the EBI holds data (sequence-derived and array-based) that could potentially be used to identify individuals. This has required the creation of a permanent repository, the European Genome-phenome Archive (EGA, see below), and new procedures for granting access to data to ensure data security and compliance to consent agreements.


Figure 1. The range of EMBL-EBI’s core data resources.

Demand for access to the EBI’s comprehensive data is high and continues to grow, averaging 3.5 million web requests on the EMBL-EBI website each day. Approximately 300,000 unique users visit the website every month, and close to a million jobs per month are performed using the EMBL-EBI’s web-services.

From molecules to systems – a tour of EMBL-EBI data resources
Descriptions and web addresses for data resources presented in this article are listed at the end (Table 1), with review articles referenced in the text.

Nucleotide sequence and whole genomes
Ultra-high throughput sequencing technologies are leading to previously unimaginable amounts of data being deposited in the public nucleotide sequence databases.

The EMBL-EBI, in collaboration with the Wellcome Trust Sanger Institute, produces the Ensembl genome browser (Flicek et al., 2010). Ensembl contains over 50 chordate genomes, facilitating genome navigation, analysis and inter-species comparison. In 2009, EMBL-EBI extended the Ensembl system to the rest of the taxonomic tree and launched Ensembl Genomes (Kersey et al., 2010) in the form of five new sites: Ensembl Bacteria, Ensembl Protists, Ensembl Fungi, Ensembl Plants and Ensembl Metazoa.

The recently launched European Nucleotide Archive (ENA; Leinonen et al., 2010) consolidates existing major sequence resources, namely, the European Trace Archive, the EMBL Nucleotide Sequence Database (EMBL-Bank) and the Sequence Read Archive. The ENA comprises three parts: ENA-Annotation for detailed functional annotation of coding sequence; ENA-Assembly for storage of sequence assemblies; and ENA-Reads for storage of sequence trace information (both capillary trace sequences and next-generation reads).

Participants in medical or genetic research projects have typically provided consent for their data to be used in research but not for open public distribution. The European Genome–phenome Archive (EGA; www.ebi.ac.uk/Information/Brochures/pdf/EGA_May10.pdf) provides a secure archiving, processing and dissemination service that respects the original informed consent agreements while allowing data access to researchers. As of mid 2010, the EGA contains data from experiments including case control studies, cancer sequencing and population studies, representing more than 50,000 individuals. The EGA can integrate the data with other available EMBL-EBI resources, for example by providing full genomic annotation via Ensembl for those variants that show significant association with the studied phenotype, or links to ArrayExpress for accessing expression data deposited from the same cohort members.

Gene expression and microarray data
Genome-wide gene expression assays, originally using microarrays and more recently high-throughput sequencing, can either answer specific questions (e.g. which genes are differentially expressed in healthy versus diseased liver) or provide reference data sets (e.g. by comparing gene expression patterns in different tissues, or at different developmental stages). Large-scale expression data sets can be used to answer questions unrelated to the purpose of the original study. For example, an analysis that reveals differentially expressed genes characteristic of a particular type of cancer may also reveal candidate genes for therapeutics development, or shed light on regulatory mechanisms perturbed in that form of cancer.

The ArrayExpress Archive (Parkinson et al., 2009) is an open-access, standards-compliant repository for data from high-throughput transcriptomics assays. Data from over 10,000 studies are available from this archive and the EMBL-EBI’s Gene Expression Atlas (GXA; Kapushesky et al., 2010) provides a simplified interface to query this data. Users can pose gene-centric queries, to find out under which conditions (or where in the organism) a gene of interest is differentially expressed. Alternatively, they can pose condition-centric queries, to find out which genes are differentially expressed in a particular condition or site. Both types of query can be combined to focus on particular genes and their role in a specific condition; for example, GXA makes it straightforward to search for members of the Wnt signalling pathway that are expressed in colorectal adenocarcinoma.

Protein sequence, families, domains and proteomics
A natural progression from completely sequenced genomes is the characterisation of the full complement of protein-coding genes produced. A first draft of the human proteome, comprising 20325 protein-coding sequences, was released in September 2008.

UniProt (The UniProt Consortium, 2010) is the globally recognised ‘gold-standard’ data resource for information about proteins. The UniProt Knowledgebase provides manually curated information on well-characterised proteins (UniProtKB/Swiss-Prot), and automatically annotated information on protein sequences mostly sourced from the ENA (UniProtKB/TrEMBL).

UniProt provides full cross-linking with PRIDE (Vizcaíno et al., 2010), EMBL-EBI’s standards-compliant resource for mass spectrometry based proteomics. This allows PRIDE data to be used to annotate UniProt protein entries.

Protein families and domains are invaluable pointers that help biologists to find distantly related proteins and to predict their functions. InterPro (Hunter et al., 2009) is an integrated documentation resource for protein families, domains and functional sites. By uniting member databases using different methods and types of biological information, InterPro provides a powerful integrated diagnostic tool for protein sequence classification.

Structures
Three-dimensional structures give us mechanistic insight into how macromolecules work, and help to explain how their functions are modified by mutation or interaction with small molecules. As structural genomics efforts begin to bear fruit, efficient access to standardised ways of viewing and describing protein structures, as provided by the Protein Databank in Europe (PDBe; Velankar et al., 2010), is essential.

Small molecules
Understanding the roles of biologically relevant ‘small molecules’ (not directly encoded by the genome) is an important component of elucidating all the processes of life. The ChEBI database (de Matos et al., 2010) provides standardised descriptions of molecules that enable other databases to annotate their entries consistently, and bridges the gap between small molecules and the macromolecules that they interact with in living systems.

The human genome sequence provided a complete molecular ‘parts list’ for researchers interested in improving human health. A key task now is to catalogue how the gene products interact with drugs and drug-like molecules. ChEMBL (www.ebi.ac.uk/Information/Brochures/pdf/ChEMBL_May10.pdf) is a chemogenomics resource for drug-like molecules that brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.

Interactions, pathways and systems
Molecular interactions provide a valuable resource for the elucidation of cellular function. IntAct provides a central, public repository of such interactions, including protein–protein, protein–small molecule and protein–nucleic acid interactions (Aranda et al., 2010).

Life on the molecular level is an intricate network of biochemical reactions and pathways. Biologists have been elucidating fragments of this network for a century, but a vast amount of the knowledge is scattered and largely inaccessible to computational investigation. Reactome (Matthews et al., 2009) is a free, online, open-source, curated pathway database, presenting information authored by expert biological researchers and cross-referenced to a wide range of other bioinformatics databases.

Strength through collaboration
All our major data resources are the products of international collaborations and interactions. We work with other data providers to ensure that our data repositories, and those of our collaborators, are comprehensive and up to date. For example:

  • The ENA is produced as part of the International Nucleotide Sequence Database Collaboration involving GenBank in the USA and the DNA Databank of Japan.
  • The ArrayExpress Archive imports data on a weekly basis from the Gene Expression Omnibus at the National Centre for Biotechnology Information in the USA.
  • UniProt is produced by the UniProt Consortium, a collaboration between EMBL-EBI, the Swiss Institute of Bioinformatics and the Protein Information Resource.
  • The PDBe is the European partner of the worldwide Protein Databank (wwPDB; Berman et al., 2007) which maintains a shared repository of bio-macromolecular structure data.
  • Reactome is produced in collaboration between EMBL-EBI, the Ontario Institute for Cancer Research, New York University Medical Center and Cold Spring Harbor Laboratory.

We also actively participate in international efforts to develop data standards in bioinformatics; these facilitate data exchange, integration and reuse.

Future challenges and opportunities
The genomic era has revolutionised biomedical research, by enabling the research community to ask questions on a genome-wide scale. Already the new DNA sequencing methods are providing the technology to sequence individual genomes, to quantify expression, to study cancer progression and to measure a patient’s responses to therapy. Realising the benefits of this knowledge to health and human wellbeing will depend crucially on applying computational methods to the vast repositories of data.

Biological experiments are now generating data at rates comparable to astrophysics or particle physics experiments. In addition to its impact on research, the genomic era has necessitated new mechanisms for data sharing, access, provision and analysis. Access to data is an essential requirement for translation of research into the successful wide-scale application of the opportunities described above. Biological data resources will lie at the heart of new discoveries and their applications, and Europe must build a new infrastructure for biological data to support this endeavour. EMBL-EBI is coordinating ELIXIR – a new preparatory phase project to create a pan-European infrastructure to support: data resources; bio-compute centres; data integration, software tools and service; training and standards development (www.elixir-europe.org). We firmly believe that this infrastructure must remain rooted in the principles of open access and international collaboration that have enabled post-genomic research to progress at such an impressive pace.

Referenced EMBL-EBI resources


Table 1. Summary of EMBL-EBI core resources

References

Aranda, B. et al. (2010) The IntAct Molecular Interaction Database in 2010. Nucleic Acids Res., 38, D525-D531.
http://nar.oxfordjournals.org/cgi/content/short/38/suppl_1/D525

Berman,H. et al. (2007) The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res., 35, D301–D303.
http://nar.oxfordjournals.org/cgi/content/full/gkl971

de Matos,P. et al. (2010) Chemical Entities of Biological Interest (ChEBI): an update. Nucleic Acids Res., 38, D249-D254.
http://nar.oxfordjournals.org/cgi/content/abstract/38/suppl_1/D249

Flicek,P. et al. (2010) Ensembl’s tenth year. Nucleic Acids Res., 38, D557-D562.
http://nar.oxfordjournals.org/cgi/content/abstract/38/suppl_1/D557

Hunter,S. et al. (2009) InterPro: the integrative protein signature database. Nucleic Acids Res. 37, D211–215.
http://nar.oxfordjournals.org/cgi/content/full/37/suppl_1/D211

Kapushesky,M. et al. (2010) Gene Expression Atlas at the European Bioinformatics Institute. Nucleic Acids Res., 38, D690–D698.
http://nar.oxfordjournals.org/cgi/content/abstract/38/suppl_1/D690

Kersey,P.J. et al. (2010) Ensembl Genomes: Extending Ensembl across the Taxonomic Space. Nucleic Acids Res., 38, D563–D569.
http://nar.oxfordjournals.org/cgi/content/abstract/38/suppl_1/D563

Leinonen,R. et al. (2010) Improvements to services at the European Nucleotide Archive. Nucleic Acids Res., 38, D39–D45.
http://nar.oxfordjournals.org/cgi/content/abstract/38/suppl_1/D39

Matthews,L. et al. (2009) Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res., 37, D619–622.
http://nar.oxfordjournals.org/cgi/content/full/gkn863?ijkey=sVeuauFiBaN9VhL&keytype=ref

Parkinson, H. et al. (2009) ArrayExpress update—from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res., 37, D868–872.
http://nar.oxfordjournals.org/cgi/content/short/37/suppl_1/D868

The UniProt Consortium (2010) The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res., 38, D142–D148.
http://nar.oxfordjournals.org/cgi/content/abstract/38/suppl_1/D142

Velankar,S. et al. (2010) PDBe: Protein Databank in Europe. Nucleic Acids Res., 38: D308–D317
http://nar.oxfordjournals.org/cgi/content/abstract/38/suppl_1/D308

Vizcaíno,J.A. et al. (2010) The Proteomics Identifications (PRIDE) database: 2010 update. Nucleic Acids Res., 38, D736–D742.
http://nar.oxfordjournals.org/cgi/content/abstract/38/suppl_1/D736




Enviar a un amigo Send to a friend    Imprimir Print

Bioinformatics
Speeding up Genome Sequencing (Cost reductions in sequencing)
Biomedical informatics and its impact on R&D in medicines
Cluster of Biocomputation or the dare of being in the frontiers between disciplines
Comparative genomics and genes prediction
Dynamics of proteins and supercomputing. A view from
 

Editorial
Schizophrenia: a pioneering example of functional genomics and educational experience
 

Applied genomics
What is the purpose of the non-coding genome?
 

New pathways toward personalized medicine
Nanotechnology in Proteomics: Protein micro arrays and novel detection systems
 

Featured Publications
Functions and organization of the R&D system in Spain
 

Biotechnology and health agents
MSD extraordinary Chair in genomics and proteomics of Complutense University. Faculty of Pharmacy.
 

 
 

 
Theoretical and practical Course in pharmacogenetics

Postgraduate course in pharmacogenetics, pharmacogenomics and personalized medicine

Newsletter No 22. Personalized medicine puts under siege Alzheimer’s disease, colorectal cancer and short stature.
Home | Biotechnology | Training | Conferences | Resources | Press | Current events | Journal Service
Privacy/Terms of use | Copyright 2012 | Contact

Proyectos de e-marketing farmaceutico