ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

Kvik: three-tier data exploration tools for flexible analysis of genomic data in epidemiological studies

[version 1; peer review: 2 approved with reservations]
PUBLISHED 30 Mar 2015
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Container Virtualization in Bioinformatics collection.

Abstract

Kvik is an open-source system that we developed for explorative analysis of functional genomics data from large epidemiological studies. Creating such studies requires a significant amount of time and resources. It is therefore usual to reuse the data from one study for several research projects. Often each project requires implementing new analysis code, integration with specific knowledge bases, and specific visualizations. Existing data exploration tools do not provide all the required functionality for such multi-study data exploration. We have therefore developed the Kvik framework which makes it easy to implement specialized data exploration tools for specific projects. Applications in Kvik follow the three-tier architecture commonly used in web applications, with REST interfaces between the tiers. This makes it easy to adapt the applications to new statistical analyses, metadata, and visualizations. Kvik uses R to perform on-demand data analyses when researchers explore the data. In this note, we describe how we used Kvik to develop the Kvik Pathways application to explore gene expression data from healthy women with high and low plasma ratios of essential fatty acids using biological pathway visualizations. Researchers interact with Kvik Pathways through a web application that uses the JavaScript libraries Cytoscape.js and D3. We use Docker containers to make deployment of Kvik Pathways simple.

Keywords

Functional genomics, Epidemiological studies, Data exploration, On-demand data analysis, Open-source software, Kvik

Introduction

Visual explorative analysis is essential to an understanding of biological functions in large-scale ‘omics’ datasets. However, enabling the inclusion of ‘omics’ data in large epidemiological studies requires collecting samples from thousands of people at different biological levels over a long period of time. It is therefore usual to reuse the data for different research questions and projects. Although an existing tool may be useful for one project, no tool provides the required functionality for several different projects.

We have therefore implemented Kvik, a framework that makes it easy to develop new applications to explore different research questions and data. We have identified five requirements for such applications:

  • Interactive The applications should provide interactive exploration of datasets through visualizations and integration with relevant information.

  • Familiar They should use familiar visual representations to present information to researchers.

  • Simple to use Researchers should not need to install software to explore their data through the applications.

  • Flexible Applications should provide support for data from several study designs. This requires the framework to adapt to the statistical analyses used by the applications.

  • Lightweight Applications and statistical analyses should be separated to make it possible for researchers to explore data without having to have the computational power to run the analyses.

There are several tools for exploring biological data in the context of pathways, such as VisANT (available online at visant.bu.edu) by1, VANTED (available online at vanted.ipk-gatersleben.de)2, enRoute by3 or Entourage by4 (both available online at caleydo.org). However, these tools do not provide the adaptability needed for exploration of multi-exposure datasets. Many existing tools place the visualization, data analysis and storage on the user’s computer, making it necessary to have a powerful computer. In addition, the tools are often stand-alone applications that require users to install them and keep both application and data up to date. In this article we describe how we used Kvik to implement Kvik Pathways, a tool for exploring gene expression in the context of biological pathways. It solves the above requirements as follows:

  • Interactive Kvik Pathways provides interactive pathway visualizations and information from the popular Kyoto encyclopedia of genes and genomes (KEGG)5 database (available online at kegg.jp).

  • Simple to use Kvik Pathways uses HTML5 and modern JavaScript libraries to provide an interactive application that runs in any modern web browser.

  • Familiar Kvik Pathways uses the familiar pathway representations from KEGG and graphical user interfaces found in modern web applications.

  • Flexible It uses the R programming language for statistical analyses (r-project.org) so that researchers can tailor analyses to fit the specific research question in each project.

  • Lightweight Kvik Pathways uses a powerful backend provided by the Kvik framework to perform statistical analyses.

Both Kvik and Kvik Pathways are open-sourced at github.com/fjukstad/kvik. We provide an online version of Kvik Pathways at kvik.cs.uit.no and a Docker image at registry.hub.docker.com/u/fjukstad/kvik to run Kvik Pathways in a local Docker instance or on a cloud service such as Amazon Web Services (aws.amazon.com) or Google Compute Engine (cloud.google.com/compute).

Methods

Kvik Pathways allows users to interactively explore a molecular dataset, such as gene expression, through a web application. It provides pathway visualizations and detailed information about genes and pathways from the KEGG databases (Figure 1). The Kvik framework provides a flexible statistics back-end where researchers can specify the analyses they want to run to generate data to be used for later visualization. For example, in Kvik Pathways we retrieve fold change for single genes every time a pathway is viewed in the application. This function is run ad-hoc on the back-end servers and generates output that is displayed in the pathways in the client’s web browser. All of these functions are implemented in a simple R script and can make use of all available libraries in R, such as Bioconductor (bioconductor.org).

89c3ebf0-ecbc-41f1-be32-5021ec897eb2_figure1.gif

Figure 1. Screenshot of the renin-angiotensin pathway (KEGG pathway id hsa04614) in Kvik Pathways.

The user has selected the gene CPA3 for further exploration.

Researchers modify this R script to, for example, select a normalization method, or to tune the false discovery rate (FDR) used to adjust the p-values that Kvik Pathways uses to highlight significantly differentially expressed genes. Since Kvik Pathways is implemented as a web application and the analyses are run ad-hoc, researchers get an updated application by simply refreshing the Kvik Pathways webpage.

Implementation

We implemented interactive visualizations using the Cytoscape.js (cytoscape.github.com/cytoscape.js) library to generate the interactive pathway visualizations, and D3 (d3js.org) for Document Object Model (DOM) manipulation such as generating bar charts with svg elements. We integrate these with the popular Bootstrap front-end framework (getbootstrap.com) to provide a familiar and aesthetically pleasing user interface.

Kvik Pathways has a three-tiered architecture of independent layers (Figure 2). The browser layer consists of the web application for exploring gene expression data and biological pathways. A front-end layer provides static content such as HTML pages and stylesheets, as well as an interface to the data sources with dynamic content such as gene expression data or pathway maps to the web application. The back-end layer contains information about pathways and genes, as well as computational and storage resources to process genomic data such as the NOWAC data repository. The Kvik framework provides the components in the back-end layer.

89c3ebf0-ecbc-41f1-be32-5021ec897eb2_figure2.gif

Figure 2. The three-tiered architecture of Kvik Pathways.

In our setup the Data Engine in the back-end layer provides an interface to the NOWAC data repository stored on a secure server on our local Stallo Supercomputer Table 1 provides the interfaces). In Kvik Pathways all gene expression data is stored on the computer that runs the Data Engine. The Data Engine runs an R session accessible over remote procedure calls (RPCs) from the browser layer using RPy2 (rpy.sourceforge.net) to interface with R.

Table 1. The REST interface to the Data Engine.

All URLs are relative to the hostname where the Data Engine server runs. On our public installation the Data Engine runs on kvik.cs.uit.no:8888. For example, use kvik.cs.uit.no:8888/genes/ to retrieve all available genes in our dataset.

URLDescription
/fc/[genes...] Calculate and retrieve fold-change for the
specified genes
/pvalues/[genes...] Calculate and retrieve p-values for the
specified genes
/exprs/[genes...] Get the raw gene expression values from
the dataset
/genes Get a list of all genes in the dataset

To create pathway visualizations the Kvik backend retrieves and parses the KEGG Markup Language (KGML) representation and pathway image from KEGG databases through its REST API (rest.kegg.jp). This KGML representation of a pathway is an XML file that contains a list of nodes (genes, proteins or compounds) and edges (reactions or relations). Kvik parses this file and generates a JSON representation that Kvik Pathway uses to create pathway visualizations. Kvik Pathways uses the Javascript visualization library Cytoscape.js (js.cytoscape.org) to create a pathway visualization from the list of nodes and edges and overlay the nodes on the pathway image. To reduce latency when using the KEGG REST API, we cache every request on our servers locally. We use the average fold change between the groups in the sample set to color the genes within the pathway maps. To highlight p-values, the pathway visualization shows an additional colored frame around the node. We visualize fold change values for individual samples as a bar chart in a side panel. This bar chart gives researchers a global view of the fold change in the entire dataset.

Operation

Kvik Pathways runs in all modern web browsers and does not require any third-party software.

Use case

We used Kvik Pathways to repeat the analyses in a previous published project (6, doi: 10.1371/journal.pone.0067270) that compared gene expression in blood from healthy women with high and low plasma ratios of essential fatty acids. Gene expression differences between groups were assessed using t-tests (p-values adjusted with the Benjamini-Hochberg method). There were 184 differentially expressed genes significant on the 5% level. When exploring this gene list originally, functional information was retrieved from GeneCards and other repositories, and the list was analyzed for overlap with known pathways using MSigDB (available online at broadinstitute.org/gsea/msigdb). The researchers had to manually maintain overview of single genes, gene networks or pathways, and gather functional information gene by gene while assessing differences in gene expression levels. With this approach, researchers are limited by manual capacity, and the results may be prone to researcher bias.

Initially, Kvik Pathways was implemented to explore gene expression data from a not yet published dataset. To use Kvik Pathways to explore the data from the analyses in6, we only needed to make small modifications to the R script used by the Data Engine. (The modified R script is found at github.com/fjukstad/kvik/blob/master/dataengine/data-engine.r). Instead of loading the unpublished dataset, we could load the dataset from6 and reuse the functions that are accessible over RPC. Currently this script is less than 30 lines, consisting of four functions to retrieve data and a simple initialization step that reads the dataset. These functions are: get(genes), genes(), f c(genes) and pvalues(genes). get retrieves all information available for the given genes. genes() returns a list of all of the genes in the dataset. f c(genes) returns the fold change for the selected genes. pvalues(genes) returns the p-values for the given genes. After updating the R script in the Data Engine researchers using Kvik Pathways only had to reload a web page to get updated Kvik Pathways.

As an example of practical use of Kvik Pathways, we chose one of the significant pathways from the overlap analysis, the renin-angiotensin pathway (Supplementary table S5 in6). The pathway contains 17 genes, and in the pathway map we could instantly identify the two genes that drive this result. The color of the gene nodes in the pathway map indicates the fold change, and the statistical significance level is indicated by the color of the node’s frame. We use this visual image of a biological process to see how these two genes (and their expression levels) are related to other genes in that pathway, giving a biologically more meaningful context as compared to merely seeing the two genes on a list.

Summary

Kvik Pathways is an open-source system for explorative analyses of functional genomics data from epidemiological studies. It uses R to perform on-demand data analyses providing a flexible back-end that can expand to new analyses and research projects. It uses modern visualization libraries and a powerful back-end for performing on-demand statistical analyses. Epidemiological researchers have used Kvik Pathways to analyze gene expression data. Kvik Pathways is open-sourced at github.com/fjukstad/kvik and is available as a Docker image at registry.hub.docker.com/u/fjukstad/kvik.

Data availability

Data used in the use case is available in the Gene Expression Omnibus (ncbi.nlm.nih.gov/geo), under accession number GSE15289.

Software availability

Source code as at the time of publication

https://github.com/F1000Research/kvik/releases/tag/1.0

Archived source code as at the time of publication

http://dx.doi.org/10.5281/zenodo.16375

Software license

The MIT license

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 30 Mar 2015
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Fjukstad B, Standahl Olsen K, Jareid M et al. Kvik: three-tier data exploration tools for flexible analysis of genomic data in epidemiological studies [version 1; peer review: 2 approved with reservations] F1000Research 2015, 4:81 (https://doi.org/10.12688/f1000research.6238.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 30 Mar 2015
Views
32
Cite
Reviewer Report 30 Apr 2015
Zhenjun Hu, Bioinformatics Graduate Program and Department of Biomedical Engineering, Boston University, Boston, MA, USA 
Approved with Reservations
VIEWS 32
The manuscript presents Kvik as an open-source system  for explorative analysis of functional genomics data from large epidemiological studies. The authors seem have excellent ideas, but the implementation of the tool is far behind these ideas. I would like to ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Hu Z. Reviewer Report For: Kvik: three-tier data exploration tools for flexible analysis of genomic data in epidemiological studies [version 1; peer review: 2 approved with reservations]. F1000Research 2015, 4:81 (https://doi.org/10.5256/f1000research.6693.r8533)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 16 Jun 2015
    Bjørn Fjukstad, Department of Computer Science, UiT - The Arctic University of Norway, Tromsø, 9037, Norway
    16 Jun 2015
    Author Response
    We would first like to thank the reviewer Zhenjun Hu for his thorough feedback and comments.
    1. Since we have open-sourced the application we believe that Kvik Pathways can provide knowledge in itself,
    ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 16 Jun 2015
    Bjørn Fjukstad, Department of Computer Science, UiT - The Arctic University of Norway, Tromsø, 9037, Norway
    16 Jun 2015
    Author Response
    We would first like to thank the reviewer Zhenjun Hu for his thorough feedback and comments.
    1. Since we have open-sourced the application we believe that Kvik Pathways can provide knowledge in itself,
    ... Continue reading
Views
35
Cite
Reviewer Report 23 Apr 2015
Paul Klemm, Department Simulation and Graphics, Faculty of Computer Science, Otto-von-Guericke University Magdeburg,  Magdeburg, Germany 
Approved with Reservations
VIEWS 35
The work presented by Fjukstad et al. pursues in pushing the notion of open science in epidemiology. It describes Kvik, a web-based tool for analyzing genomic pathways. I really like the ideas behind it and value the detailed implementation section ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Klemm P. Reviewer Report For: Kvik: three-tier data exploration tools for flexible analysis of genomic data in epidemiological studies [version 1; peer review: 2 approved with reservations]. F1000Research 2015, 4:81 (https://doi.org/10.5256/f1000research.6693.r8165)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 16 Jun 2015
    Bjørn Fjukstad, Department of Computer Science, UiT - The Arctic University of Norway, Tromsø, 9037, Norway
    16 Jun 2015
    Author Response
    We would first like to thank the reviewer Paul Klemm for his thorough feedback and comments.

    Difference between the NIK paper and the Application Note

    When we wrote the NIK paper, Kvik ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 16 Jun 2015
    Bjørn Fjukstad, Department of Computer Science, UiT - The Arctic University of Norway, Tromsø, 9037, Norway
    16 Jun 2015
    Author Response
    We would first like to thank the reviewer Paul Klemm for his thorough feedback and comments.

    Difference between the NIK paper and the Application Note

    When we wrote the NIK paper, Kvik ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 30 Mar 2015
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.