Keywords
minimum spanning tree, gene expression data, euclidean distance, Hamiltonian cycle
This article is included in the Cytoscape gateway.
minimum spanning tree, gene expression data, euclidean distance, Hamiltonian cycle
Graph theory is being widely used for network analysis in various fields1. Extraction of various kinds of subnetworks is one of the ways to identify functional modules within complex networks2. A tree is a subnetwork with minimal connections. Specifically in graph theory, a tree is a graph with only one path between every two nodes. In other words, any connected graph without simple cycles is a tree. Given a connected graph, which is not a tree, one can extract a tree from it by eliminating cyclic edges. A spanning tree contains all the nodes of the graph and has (N-1) edges where N is the number of nodes in the given graph. Extracting a spanning tree gets interesting when edges of the given graph have weights. In finding the minimal/maximal spanning tree, one would ideally extract the tree whose sum of weights is minimum/maximum respectively. The weight of a spanning tree is the sum of weights given to each edge of the spanning tree. There may be several minimum spanning trees of the same weight; in particular, if all the edge weights of a given graph are the same, every spanning tree of that graph is minimal. If each edge has a distinct weight then there will be only one unique minimum spanning tree.
In this paper, we present CySpanningTree, a Cytoscape3 3 app for extracting a spanning tree from a given graph. Once the user imports a dataset, by clicking the “Create spanning tree” button of the app, a new spanning tree network is created in the network panel of Cytoscape. Historically, spanning trees are used in various applications like constructing a road network between cities with a minimum cost, as a heuristic for the traveling salesman problem (TSP), for the spanning tree network optimization protocol in networking, clustering gene expression data, etc. Three of the mentioned cases have been demonstrated in the use cases section.
CySpanningTree is the Java implementation of Prim’s4 and Kruskal’s algorithms5, using the Cytoscape 3 API and Java 7 for extracting a minimal spanning tree (MST). An MST for a given graph might not be unique, however for a given same Cytoscape session, the tie-breaking approach for selecting edges of equal weights is deterministic. The user gets the same spanning tree in a given Cytoscape session unless he reloads the network.
This tool also has a “Create Hamiltonian cycle” button which invokes the computation of the Hamiltonian cycle6. For computing this cycle, it first finds an MST using Prim’s algorithm and then performs a pre-order traversal on it. This pre-order traversal is a modified version of the depth-first search algorithm which results in a Hamiltonian path. Later, we connect the last node and the first node of this path to make a cycle. Users are recommended to run the Hamiltonian cycle algorithm on a fully connected graph to avoid missing of the edges while traversing.
Table 1 has the complexities of the algorithms and the uniqueness of the outputs used in the app. Prim’s algorithm runs using adjacency list representation of the graph and thus implemented with a complexity O(V2). Kruskal’s algorithm runs using adjacency matrix of the graph and has a complexity of O(EV2(E+V)). The Hamiltonian cycle first calculates a spanning tree using Prim’s algorithm with a complexity of O(V2) and then runs depth-first search algorithm with a complexity O(E + V).
The GUI component of CySpanningTree is represented as a tabbed panel in the control panel of Cytoscape. Cytoscape takes care of loading the input network. The CySpanningTree menu (Figure 1) loads in the control panel of Cytoscape by selecting it from App menu. Currently the app runs only on connected networks. When the user tries to execute a spanning tree algorithm on an unconnected graph, an error message pops up. For weighted graphs, the user has to select the edge attribute from the drop down list (which is by default “None” that treats all edges with the same weight).
Prim’s algorithm starts with a root node and hence the user is asked for the same when the Prim’s Spanning Tree button is pressed. If the user enters a node that is not in the network, the user gets an error message and the program terminates.
The resultant MST or the Hamiltonian cycle network has the same layout as that of the input network with nodes positioned at the same location and edges scaled down. When spanning tree subnetworks are created, the corresponding spanning edges are highlighted in the input network. In Figure 2, the input network is a fully connected graph of capital cities of countries in the world, containing 203 cities and 20503 connections between them. The resultant networks: “Kruskal’s Spanning Tree”, “Prim’s Spanning Tree” and “Hamiltonian Cycle” are connected graphs containing all the 203 cities and only 202, 202 and 203 edges respectively. Spanning trees are extracted as separate Cytoscape networks under the same network collection as shown in Figure 2.
In this section, we present the spanning tree results on use cases with datasets in four scenarios: gene expression matrix of gene expression data, building a cost efficient road network when all possible costs are known, an approximate solution to the travelling salesman problem and connecting a 10-home village with phone lines with minimum wiring. In each scenario, the contents of the network are introduced first and then extraction of spanning trees is demonstrated.
The expression levels of genes when exposed to various environmental conditions are recorded at different times with different samples. This data is called gene expression data and is analyzed to extract the similarities between genes. Gene expression data for n genes is multi-dimensional data with each for given m expression levels. Here represents the ith gene and represents the jth expression level of this ith gene.
This data has been simulated as a graph with nodes being genes and edges being the genetic distance between them. Genetic distance is defined as the measurement of similarity between genes.
Euclidean distance between genes and =
For each pair of genes, this genetic distance is calculated which gives a fully connected graph. The data set7 has been taken from the Saccharomyces Genome Database and contains expression levels of budding yeast — S. cerevisiae with a total of 6149 genes (http://downloads.yeastgenome.org/expression/microarray/Cho_1998_PMID_9702192/). Typically, it becomes difficult to visualize a large graph of 6149 nodes with each node connected to every other node in the graph. A spanning tree of the gene expression data makes it possible to visualize such a large network as shown in Figure 3.
Input network: A fully connected graph of S. cerevisiae expression data
Nodes: Genes of S. cerevisiae
Edges: Euclidean distance between genes calculated using expression levels
Output network (Figure 3): Kruskal’s spanning tree of the input gene expression data
Although a lot of edges are removed from the network during the process of creating a spanning tree, no essential information is lost8. A spanning tree is a better way to visualize large networks compared to fully connected graphs. We observed that genes with similar functionalities are connected closely in the resultant spanning tree. Many clustering algorithms have been applied to gene expression data8,9, we are currently working on clustering using minimum spanning trees for our next release of CySpanningTree.
This dataset10 consists of nodes which are capital cities of all countries in the world and edges between them representing the distance in kilometers. These distances are measured using latitude and longitude coordinates of the cities (http://privatewww.essex.ac.uk/~ksg/data-5.html). This dataset, when imported into Cytoscape, results in a fully connected graph as the distance is calculated for each pair of capital cities. Prim’s algorithm has been executed on this dataset to produce a MST network as shown in Figure 5
Input network: Fully connected graph of capitals cities as shown in Figure 4
Nodes: Capital cities of all countries in the world
Edges: Displacement between cities
Output minimum spanning tree: Network with minimum cost such that each city is connected. Cities separated with large distances are represented with strong edges as shown in Figure 5
Furthermore, this solution can be used for drawing a Hamiltonian cycle which is an approximation to the Travelling Salesman problem. Drawing a Hamiltonian cycle for a smaller network is discussed in the next subsection.
The TSP is a well-known combinatorial optimization problem. The goal is to find the shortest tour that visits each city in a given list exactly once and returns to the starting city. Though the problem statement looks simple, TSP is NP-complete11. Even though the problem is computationally difficult, a large number of heuristic solutions12 are known due to the number of applications of this problem13 like planning, logistics, DNA sequencing, predicting protein functions, etc.
Pre-order traversal on a minimum spanning tree is one of the heuristic solutions for TSP5,14. In this subsection, a Hamiltonian cycle is drawn for a spanning tree to show that the resultant cycle is a near solution to the TSP. The optimal TSP tour in Figure 9 is about 17% shorter than the Hamiltonian cycle obtained using spanning tree in Figure 8. On executing the Hamiltonian cycle algorithm on the input network, the software will create both Prim’s spanning tree as well as the Hamiltonian cycle. Five nodes from the above capital city network are used for the TSP use case.
This dataset consists of houses depicted as nodes and the edges are the means by which one house can be wired up to another. The weights of the edges dictate the distance between the houses. The task of the telephone company is to wire all houses using the least amount of telephone wiring possible.
Input network: Houses in village depicted as graph as shown in Figure 10
Nodes: Houses H1 to H10
Edges: Distance between the houses
Output MST: Network which connects the houses via wires with least possible wiring. Figure 11 and Figure 12 are the spanning trees obtained using Prim’s (H1 as root node) and Kruskal’s algorithm, respectively.
In this paper, we present CySpanningTree app for Cytoscape 3. CySpanningTree fills an important need for many Cytoscape users and researchers in obtaining spanning trees across different types of networks. CySpanningTree makes effective use of the Cytoscape 3 API in extracting the subnetwork and creating it as a separate network. In the near future, we will be exploring MST based clustering and we are determined to explore more datasets whose spanning tree evaluation is significant.
CySpanningTree app can be downloaded from the Cytoscape app store.
FS and SB conceived the CySpanningTree app. NG supervised the project. FS contributed to the implementation of Kruskal’s algorithm, Hamiltonian cycle and user interface of the app. SB contributed to the implementation of Prim’s algorithm. FS and SB worked on the use cases. FS and SB wrote the manuscript. NG participated in the design of the app and in the revision of the manuscript.
The authors would like to thank their professor Bharat.M.Deshpande for shaping and motivating their interests towards Discrete Mathematics, Scooter Morris from Cytoscape open source community for helping with Cytoscape API to extract the subnetwork in an intuitive way.
Cytoscape session files for use cases. Cytoscape session files (*.cys) for the TSP, world network, and 10-home village use cases.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
References
1. Dahlhaus E, Johnson D, Papadimitriou C, Seymour P, et al.: The Complexity of Multiterminal Cuts. SIAM Journal on Computing. 1994; 23 (4): 864-894 Publisher Full TextCompeting Interests: No competing interests were disclosed.
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 05 Aug 15 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)