Elsevier

Methods in Enzymology

Volume 411, 2006, Pages 370-386
Methods in Enzymology

[20] Data Storage and Analysis in ArrayExpress

https://doi.org/10.1016/S0076-6879(06)11020-4Get rights and content

Abstract

ArrayExpress is a public resource for microarray data that has two major goals: to serve as an archive providing access to microarray data supporting publications and to build a knowledge base of gene expression profiles. ArrayExpress consists of two tightly integrated databases: ArrayExpress repository, which is an archive, and ArrayExpress data warehouse, which contains reannotated data and is optimized for queries. As of December 2005, ArrayExpress contains gene expression and other microarray data from almost 35,000 hybridizations, comprising over 1200 studies, covering 70 different species. Most data are related to peer‐reviewed publications. Password‐protected access to prepublication data is provided for reviewers and authors. Data in the repository can be queried by various parameters such as species, authors, or words used in the experiment description. The data warehouse provides a wide range of queries, including ones based on gene and sample properties, and provides capabilities to retrieve data combined from different studies. The ArrayExpress resource also includes Expression Profiler (EP)—a microarray data mining, analysis, and visualization tool—and MIAMExpress—an online data submission tool. This chapter describes all major ArrayExpress components from the user perspective: how to submit to, retrieve from, and analyze data in ArrayExpress.

Introduction

Since the first genome‐wide microarray gene expression studies were published in 1997 (e.g., De Risi et al., 1997), microarrays have become a standard technology in life sciences research. The amounts of data generated in a single microarray experiment considerably exceed that generated by any traditional technology, or by DNA sequencing. Not only do microarrays produce large amounts of data, but these data are complex. A series of non‐trivial data processing steps have to be applied to raw microarray data to obtain biologically meaningful results. It has been widely acknowledged that to interpret microarray experiment results both raw and processed data are needed, as well as metadata describing the biological samples and experimental and data transformation procedures. These requirements are summarized in the MIAME guidelines (Ball et al., 2004a), which have been adopted by a growing number of scientific journals. However, publishing, maintaining, and providing access to MIAME‐compliant microarray data on an author's or a journal's web site is not a trivial task—professionally developed and maintained public repositories are more appropriate for it (Brazma et al., 2000). Storing these data centrally also allows for access to all data on the same web site using a standard interface. The European Bioinformatics Institute (EBI) established a MIAME supportive public repository for microarray data ArrayExpress in 2002 (Brazma 2003, Parkinson 2005, Sarkans 2005).

As the numbers of laboratories using microarrays are increasing, data submission tools are improving, and journals are becoming more forceful in requiring submission to public repositories, the volume of data in ArrayExpress is growing rapidly. Its size has tripled during the last 12 months, and as of December 2005, the repository contains almost 35,000 hybridizations comprising over 800 studies related to 70 species (Fig. 1). The available studies cover a wide variety of experiment types, such as gene expression related to compound treatments, disease states, organism part comparisons, or developmental studies (Fig. 2). For instance, the experiment with accession number E‐TOXM‐16 investigates whether genotoxic carcinogens at doses known to induce liver tumours in rat bioassay deregulate a common set of genes in a short‐term in vivo study. Raw and normalized data are provided. The experiment uses 137 hybridizations on 126 different samples on Affymetrix array RG_U34A. It combines experimental factors compound, dose, and time. The experiment E‐UMCU‐12 studies 9‐day glucose starvation stationary phase culture in yeast Saccharomyces cerevisiae exit and entry from quiescence. It provides time series data for 34 time points and provides raw, normalized, and normalized smoothed data. Among other gene expression data sets in the database are human and mouse tissue expression data (e.g., E‐AFMX‐4, E‐AFMX‐5) and Arabidopsis thaliana development and differentiation expression data (e.g., E‐AFMX‐8). Slightly over 20% of the gene expression experiments provide time course data. Roughly a third of the experiments have been performed on the Affymetrix platform.

Although most data relate to gene expression, number of experiments used array comparative genomic hybridization (Erickson and Spana, 2006) or DNA‐binding site identification (so‐called ChIP‐on‐chip experiments) (Negre 2006, Scacheri 2006). An example of nongene expression data sets is ChIP‐chip data for most yeast transcription factors (E‐WMIT‐1,2,10).

ArrayExpress is one of the three international repositories recommended by the Microarray Gene Expression Data (MGED) society (Ball et al., 2004b) for storing microarray data related to publications [the other two being GEO (Barrett 2006, Edgar 2002) and CIBEX (Ikeo et al., 2003)]. This defines the role of ArrayExpress as a primary archive and obliges it to accept all microarray data related to peer‐reviewed publications without any changes unless approved by the submitter. The second goal of ArrayExpress is to build a knowledge base of gene expression providing easy access to high‐quality, well‐annotated data characterizing expression profiles of all genes in different organisms under different conditions. To meet these two goals, a separate database from the ArrayExpress repository, namely the ArrayExpress data warehouse, has been developed. It contains a subset of MIAME‐compliant reannotated data and provides more powerful queries—ones based on gene names and properties. It also allows retrieval of data combined from different studies. As of December 2005 the ArrayExpress warehouse contained about 5% of data from the repository, but this percentage is expected to grow substantially in 2006.

Two additional tools are available to the user as a part of ArrayExpress: Expression Profiler is an online microarray data analysis tool linked to the database and MIAMExpress is a web‐based microarray data annotation and submission tool (Fig. 3).

The target user community for ArrayExpress includes three major groups.

  • Microarray experimentalists who are interested in experimental designs, array designs and protocols, and data from published experiments. This group will primarily use the repository, as well as data submission tools for submitting their own data.

  • Biologists who are interested in expression patterns of particular genes. This group is primarily served by the data warehouse through gene attribute‐based queries.

  • Biologists and bioinformaticians who are interested in genome‐wide studies. They can use the repository to upload published data sets in their own analysis tools or analyze them online using Expression Profiler, as well as use the data warehouse to retrieve gene expression data matrices combined from different experiments.

The next two sections describe (1) how data can be queried and retrieved from the ArrayExpress repository and warehouse, respectively, (2) how data can be analyzed in Expression Profiler, and (3) how data can be submitted to ArrayExpress.

Section snippets

How to Query and Retrieve Data from the ArrayExpress Repository

Reflecting its archival role, the ArrayExpress repository organizes data by experiments, that is, a collection of hybridizations related to a particular study, often related to a publication. Each experiment can be retrieved by its accession number. Additionally, array designs and protocols have their own accession numbers, which enable the experimentalists to reuse the arrays and protocols submitted earlier (possibly by a different submitter), thus facilitating standardization.

The repository

How to Query Data in the ArrayExpress Data Warehouse

The ArrayExpress Data warehouse is a separate database that contains a subset of reannotated data from the ArrayExpress repository. Experiments in the repository are reviewed carefully by the ArrayExpress curators and selected for the warehouse on the basis of the quality of annotation, presence of raw and normalized data, and MIAME compliance. The array annotation is improved and updated using the Ensembl genome database if the respective array features have been mapped in Ensembl. For genomes

Data Analysis with Expression Profiler

Expression Profiler is a web‐based tool that provides access to many basic exploratory analysis and visualization modules for microarray data (Kapushesky et al., 2004). Data can be loaded into EP (http://www.ebi.ac.uk/expressionprofiler/) from ArrayExpress or from any source, including the user's own desktop PC. EP presents a graphical user interface to the most popular components of BioConductor (Gentleman 2004, Reimers 2006), in addition to providing several unique tools implemented within

How to Submit Data to ArrayExpress

Data can be submitted to ArrayExpress either online using the submission tool MIAMExpress (http://www.ebi.ac.uk/miamexpress/) or as MAGE‐ML files (Spellman et al., 2002) from external databases or applications. To use MIAMExpress, one needs only a web browser. No prior knowledge of MIAME guidelines is required, as the tool guides the submitter through a series of web forms and context‐sensitive help is provided. Large submissions can be made via a spreadsheet upload submission system tab2mage.

Future

One of the immediate future goals of ArrayExpress is to populate the data warehouse with more data from the repository. We estimate that 50% of data submitted to the repository will eventually be loaded into the warehouse. At the same time we are working to extend the functionality of the data warehouse, and new features, such as selecting genes by similar expression profiles (which is already possible in Expression Profiler), will be added. The data warehouse will be closely integrated with

Acknowledgments

EMBL, the European Commission (FELICS), International Life Sciences Institute (ILSI), and the National Institutes of Health (NIH) support ArrayExpress development.

References (29)

  • A. Brazma et al.

    ArrayExpress: A public repository for microarray gene expression data at the EBI

    Nucleic Acids Res.

    (2003)
  • A. Brazma et al.

    “One‐stop shop for microarray data.”

    Nature

    (2000)
  • A.C. Culhane et al.

    Between‐group analysis of microarray data

    Bioinformatics

    (2002)
  • J.L. DeRisi et al.

    Exploring the metabolic and genetic control of gene expression on a genomic scale

    Science

    (1997)
  • Cited by (34)

    • Network pharmacology-based preventive effect of XZF on cutaneous toxicities induced by EGFR inhibitor

      2020, Biomedicine and Pharmacotherapy
      Citation Excerpt :

      An interactive network for the putative XZF drug targets and known skin toxicities induced by EGFRI-related targets of XZF was constructed based on their interaction data. To identify the main DEGs (differentially expressed genes) between Normal human keratinocyte and EGFRI (TNF-α, PD168393) treated human keratinocyte, microarray data GSE74407 was downloaded from the Gene Expression Omnibus database (GEO, http://www.ncbi.nlm.nih.gov/geo/) [15]. Normal human keratinocyte cultures were treated with Tumor Necrosis Factor (TNF) alpha (50 ng/ml) alone (SAMPLES #4-6), or with the concomitant administration of the small-molecule EGFR inhibitor PD168393 (could cause skin toxicity [16], 2 microM) (SAMPLES #7-9) for 6 h. Cultures were performed in triplicate and include untreated controls (SAMPLES #1-3).

    • Integration of molecular and cellular pathogenesis: A bioinformatics approach

      2018, Molecular Pathology: The Molecular Basis of Human Disease
    • From data to function: Functional modeling of poultry genomics data

      2013, Poultry Science
      Citation Excerpt :

      This change in the way gene expression studies are done is borne out by a cursory examination of the type of gene expression data submitted to gene expression repositories. Currently (August, 2012) the National Center for Biotechnology Information Gene Expression Omnibus Database (Barrett and Edgar, 2006) and its European partner the ArrayExpress Archive (Brazma et al., 2006) together contain 5,427 avian gene expression data sets. These data sets are predominantly based upon microarray platforms, mostly from chicken but also from turkey and zebra finch.

    • Integration of Molecular and Cellular Pathogenesis: A Bioinformatics Approach

      2009, Molecular Pathology: The Molecular Basis of Human Disease
    View all citing articles on Scopus
    View full text