Software Manual

     

  Content:


To start Machaon CVE double-click on the shortcut icon (on your desktop). Be sure that the Java VM is running properly (please, use J2SE 1.4.2 version).

The software is realized as multi-windows application, which allows working with different datasets, algorithms and results simultaneously. The Main Window (panel) contains the Menu and indicates current working dataset.

 


To load a sample dataset, select File->Open from the Menu. Machaon CVE supports tab delimited text and XML-based files (see File Format section). Before loading the dataset the Parameter Window appers to select the Row Contain parameters (Objects or Features).

To demonstrate the features of the program the leukemia dataset from Golub et. al. is used. After loading the dataset, the Data Set Window is displayed.


The Data Set Window contains the expression Table and the Result Tree with list of all clustering and validation results obtained. The table depicts the gene expression values and, if clustering has been obtained, indicates the partitioning into the Custer Sets (the right column(s) in the Table).


The are several datasets (ready to work in Machaon) to download:

* The original data and experimental methods are available at http://www.genome.wi.mit.edu/MPR

**The original data and experimental methods are available at http://genome-www.stanford.edu


  Machaon reads tab-delimited text and XML-based files. A tab-delimited text format, described below. Such text files can be created and exported in any standard spreadsheet program, such as Microsoft Excel.

Tab-telimited

All files should have the following format:

Number of rows

Number of columns

S1 or G1

S2 or G2

...

Si or Gj

 

S1 or G1

NC1

V11

V12

...

V1j

C1

S2 or G2

NC2

V21

V22

...

V2j

C2

...

...

...

...

...

Si or Gj NCk Vi1 Vi2 ... Vij Cn
C1 C2 ... Cn

The "Number of rows" and "Number of columns" indicate the numerical values of rows and columns in the expression table. The terms Si , 1< i < Ns are the names or descriptions of the experimental samples, conditions, strains, or specimens (number of the samples in the dataset equals Ns); Gj , 1< j < Ng, are the names or descriptions of the gene names (number of the genes in the dataset equals Ng); NCk, , 1< k < Nnc are the names or descriptions of the natural classes (number of the natural classes in the dataset equals Nnc). The terms Vij represent the data values for the ith sample/experiment and the jth gene. The terms Cn , 1< n < Nc are the names of  the clusters to which the sample/gene is referred (number of the clusters in the dataset equals Nc). Bold entries indicate necessary records. The program can read files, which already contain the number of clusters (datasets, which has already been clustered by other software tools). Thus, the user could apply the validation techniques to the data files, which are provided by other systems.

Here is the examples originated from leukaemia data:

Example 1

 5                                  3                      U22376           X59417           U05259

sample_12                    ALL                 551                  846                  2504                0

sample_25                    ALL                 1872                3878                5070                1

sample_34                    AML                1126                782                  711                  1

sample_35                    AML                880                  490                  654                  0

sample_36                    AML                473                  1648                -14                   1

 

Example 2

3                                  4          sample_12        sample_25        sample_34        sample_35

U22376                       -           551                  1872                1126                880                  0

X59417                       -           846                  3878                782                  490                  1

U05259                       -           2504                5070                711                  654                  0

 

                    XML-based

 

The description of the XML-based format may be found here.


Data Transformation

Currently, two types of data transformation is presented Log Normalization and Row Normalization (normalizes intensities for a given table to be mean zero, variance 1 across all genes). The transformations are offered as a convenience to the user.

To apply the data transformation to the current dataset, simply select the Menu item Transformation -> Row Normalization or Transformation -> Log Normalization. The Data Set Window with transformed current dataset will appear.


Machaon obtains the clustering algorithms to the both, rows and columns, of the table.

Hierarchical Clustering

To start the Hierarchical Clustering calculation simply select the Menu item Clustering -> Hierarchical. The Parameter Window will appear to select the parameters such as:

  • Metrics  (preset is Euclidean distance); Types of Metrics

  • Intercluster distance (preset is Complete linkage); Types of Intercluster distances

  • Type of Data transformation (preset is None transformation); 

  • Number of sought clusters (N) (preset is two clusters);

  • What to Cluster (Objects of Features).

 

To start the calculation process simply click Next. As soon as the calculation has been completed, a new entry is added to the Results Tree. The result of clustering is also indicated in the expression Table.

To see the Hierarchical Dendrogram, mark (left mouse bottom) the desired hierarchical clustering in the Result Tree and select the menu item View->Dendrogram.

K-Means Clustering

To start the K-Means Clustering calculation simply select the Menu item Clustering -> K-Means. The Parameter Window will appear to select the parameters such as:

  • Metrics  (preset is Euclidean distance); Types of Metrics

  • Number of sought clusters (K) (preset is two clusters);

  • Type of Data transformation (preset is None transformation);

  • Type of Initialization of K (preset is First K elements);

  • What to Cluster (Objects of Features).

To start the calculation process simply click Next. As soon as the calculation has been completed, a new entry is added to the Results Tree. The result of clustering is also indicated in the expression Table.

K-Medoids Clustering

To start the K-Medoids Clustering calculation simply select the Menu item Clustering -> K-Medoids. The Parameter Window will appear to select the parameters such as:

  • Metrics  (preset is Euclidean distance); Types of Metrics

  • Number of sought clusters (K) (preset is two clusters);

  • Type of Data transformation (preset is None transformation);

  • What to Cluster (Objects of Features).

To start the calculation process simply click Next. As soon as the calculation has been completed, a new entry is added to the Results Tree. The result of clustering is also indicated in the expression Table.

Weak Clustering

To start the Weak Clustering calculation simply select the Menu item Clustering -> Weak Clustering. The Parameter Window will appear to select the parameters such as:

  • Metrics  (preset is Euclidean distance); Types of Metrics

  • Number of sought clusters (K) (preset is two clusters);

  • Type of Data transformation (preset is None transformation);

  • What to Cluster (Objects of Features).

To start the calculation process simply click Next. As soon as the calculation has been completed, a new entry is added to the Results Tree. The result of clustering is also indicated in the expression Table.


Machaon contains support for ensemble clustering, which involves combining a collection of multiple "base" clusterings to produce an improved partition of a data set. The ensemble clustering process involves two stages:

  1. Generation - The creation of the collection of "base" clustering.
  2. Integration - The aggregation of the collection of clusterings, using a "meta-clustering" procedure, to produce the output solution.

To start the Ensemble Clustering prodecure, simply select the Menu item Clustering -> Ensemble Clustering. The first Parameter Window will appear to select the basic ensemble parameters such as:

  • Generation Method:  The choice of method used to produce a diverse set of base clusterings.

  • Meta Algorithm:  The choice of "meta-clustering" algorithm used to aggregate the base clusterings.

  • Number of sought clusters (N) for the output solution.

 

After selecting the basic parameters, a second Parameter Window will appear to select the parameters for the generation procedure. The list of parameters available depends on the generation method chosen previouly.

To start the clustering process simply click Next. As soon as the calculation has been completed, a new entry is added to the Results Tree. The result of clustering is also indicated in the expression Table.


Validation

To apply any validation technique, it is necessary to select the Cluster Set first and then choose the validation method from the Menu.

C-index

To start the C-index calculation for the current Cluster Set, simply select the Menu item Validation -> C-index. The Parameter Window will appear to select the C-index parameters such as:

  • Metrics (preset is Euclidean distance); Types of Metrics

  • Type of Data transformation (preset is None transformation).

 

To calculate the index, clicks Validate. As soon as the calculation has been completed, a new entry is added to the Results Tree. The result of validation is attaches to clustering result in the tree. Because of a high computational complexity the calculation of C-index for large datasets could be very time-consuming. How to interpret the results

Davis-Bouldin index

 To start the Davis-Bouldin index calculation for the current Cluster Set, simply select the Menu item Validation -> Davis-Bouldin index. The Parameter Window will appear to select the Davis-Bouldin index parameters such as:

To calculate the index, clicks Validate. As soon as the calculation has been completed, a new entry is added to the Results Tree. The result of validation is attaches to clustering result in the tree.How to interpret the results

 Dunns index

To start the Dunns index calculation for the current Cluster Set, simply select the Menu item Validation -> Dunn index. The Parameter Window will appear to select the Dunns index parameters such as:

To calculate the index, clicks Validate. As soon as the calculation has been completed, a new entry is added to the Results Tree. The result of validation is attaches to clustering result in the tree. How to interpret the results

 Goodman-Kruskal index

To start the Goodman-Kruskal index calculation for the current Cluster Set, simply select the Menu item Validation -> Goodman-Kruskal index. The Parameter Window will appear to select the Goodman-Kruskal index parameters such as:

  • Metrics (preset is Euclidean distance); Types of Metrics

  • Type of Data transformation (preset is None transformation).

To calculate the index, clicks Validate. As soon as the calculation has been completed, a new entry is added to the Results Tree. The result of validation is attaches to clustering result in the tree. Because of a high computational complexity the calculation of Goodman-Kruskal index for large datasets could be very time-consuming. How to interpret the results

 Silhouette index

To start the Silhouette index calculation for the current Cluster Set, simply select the Menu item Validation -> Silhouette. The Parameter Window will appear to select the Silhouette index parameters such as:

  • Metrics (preset is Euclidean distance); Types of Metrics

  • Type of Data transformation (preset is None transformation).

To calculate the index, clicks Validate. As soon as the calculation has been completed, a new entry is added to the Results Tree. The result of validation is attaches to clustering result in the tree. How to interpret the results

 Isolation index

To start the Isolation index calculation for the current Cluster Set, simply select the Menu item Validation -> Isolation. The Parameter Window will appear to select the Isolation index parameters such as:

  • Metrics (preset is Euclidean distance): Type of Metrics

  • NeighbourhoodSize: The proportion of the total size of the dataset to use as the number of nearest neighbours (e.g. a value of 0.1 on a dataset containing 100 elements will use 10 nearest neighours).

To calculate the index, clicks Validate. As soon as the calculation has been completed, a new entry is added to the Results Tree. The result of validation is attaches to clustering result in the tree. How to interpret the results

Jaccard index

To start the Jaccard index calculation for the current Cluster Set, simply select the Menu item Validation -> Jaccard index. The Parameter Window will appear to indicate that there are no parameters are required for this procedure.

To calculate the index, clicks Validate. As soon as the calculation has been completed, a new entry is added to the Results Tree. The result of validation is attaches to clustering result in the tree.How to interpret the results

Rand index

To start the Rand  index calculation for the current Cluster Set, simply select the Menu item Validation -> Rand index. The Parameter Window will appear to indicate that there are no parameters are required for this procedure.

To calculate the index, clicks Validate. As soon as the calculation has been completed, a new entry is added to the Results Tree. The result of validation is attaches to clustering result in the tree. How to interpret the results

Class Accuracy

To start the Class Accuracy calculation for the current Cluster Set, simply select the Menu item Validation -> Class Accuracy. The Parameter Window will appear to indicate that there are no parameters are required for this procedure.

To calculate the index, clicks Validate. As soon as the calculation has been completed, a new entry is added to the Results Tree. The result of validation is attaches to clustering result in the tree. How to interpret the results