Content:
An XML Schema is proposed for the representing multiple clustering and validation results within a DNA microarray dataset. cluML, a free, open, XML-based format, is a new markup language for microarray data clustering and cluster validity assessment.
A dataset consists of a number of named samples/conditions (objects) represented by the values of expression of some set of genes (features). Thus, in more broad terms the format represents a single dataset consisting of named objects. Each object is represented by an element containing a number of child nodes representing named features. Each feature has a numerical value.
For example:
<objects type="microarray-samples">
. . .
<object name="sample_31">
<feature name="U22376" value="408" />
<feature name="X59417" value="1784" />
. . .
</object>
<object name="sample_32">
<feature name="U22376" value="1047" />
<feature name="X59417" value="1214" />
. . .
</object>
. . .
</objects>
The only object type specified by the format at the moment is “microarray-samples”, while other values of type attribute may be used for different applications.
The format supports representing of multiple partitionings of both objects and features, as well as multiple sets of biclusters attached to the dataset. Each partitioning element contains a set of corresponding (bi)clusters. And each (bi)cluster by itself contains a set of references to either objects, or features, or - in the case of bicluster - both, for example:
<partitioning name="K-means clustering results" method="K-means">
<object-clusters>
<cluster name="0">
<object name="sample_1" />
<object name="sample_4" />
</cluster>
<cluster name="1">
. . .
</cluster>
</object-clusters>
. . .
or
<partitioning name="biclustering">
<biclusters>
<cluster name="1">
<object name="sample_31" />
<object name="sample_33" />
<feature name="U05259" />
<feature name="M92287" />
</cluster>
</biclusters>
. . .
At the same time, each partitioning may contain a set of parameters used by a particular clustering algorithm implementation, which has produced it. For example:
<cluster-parameters>
<parameter name="Metrics" value="Euclidean" />
<parameter name="K" value="2" />
<parameter name="Transformation" value="No transformation" />
<parameter name="Initialization" value="First K elements" />
</cluster-parameters>
The current version of the format does not specify any particular conventions for those parameters names and values. Validation results associated with each partitioning are also reflected in the format. The results of each validation contain a method identity, a set of validation parameters (if any) and the results themselves, for example:
<validation method="DB">
<validation-parameters>
<parameter name="Metrics" value="Euclidean" />
<parameter name="Intercluster Distance" value="Complete linkage" />
<parameter name="Intracluster Distance" value="Complete diameter" />
<parameter name="Transformation" value="No transformation" />
</validation-parameters>
< validation-results>
<result name="Davies-Bouldin Index" value="1.4849593243324117" />
</validation-results>
</validation>
Both parameters and results are just name-value pairs. For many algorithms the validation result contains a single value, but such algorithms as Silhouettes may produce multiple values. Each partitioning may have a number of results associated with it.
To summarise the format contains:
a dataset;
multiple partitionings of its objects and/or features;
multiple sets of biclusters;
multiple validation results for each partitioning;
names of clustering and validation methods used;
parameters for each clustering and validation method used.
<xsd:schema xmlns:xsd = "http://www.w3.org/2001/XMLSchema">
<xsd:element name="dataset" type="Dataset" />
<xsd:complexType name="Dataset">
<xsd:sequence>
<xsd:element name="objects" type="Objects" />
<xsd:choice minOccurs="0" maxOccurs="unbounded">
<xsd:element name="note" type="Note" />
<xsd:element name="partitioning" type="Partitioning" />
</xsd:choice>
</xsd:sequence>
<xsd:attribute name="name" type="xsd:string" use="required" />
</xsd:complexType><xsd:complexType name="Objects">
<xsd:sequence>
<xsd:element name="object" type="Object" minOccurs="1" maxOccurs="unbounded" />
</xsd:sequence>
<xsd:attribute name="type" type="xsd:string" use="required" />
</xsd:complexType><xsd:complexType name="Object">
<xsd:sequence>
<xsd:element name="feature" type="Feature" minOccurs="1" maxOccurs="unbounded" />
</xsd:sequence>
<xsd:attribute name="name" type="xsd:string" use="required" />
</xsd:complexType><xsd:complexType name="Feature">
<xsd:attribute name="name" type="xsd:string" use="required" />
<xsd:attribute name="value" type="xsd:float" use="required" />
</xsd:complexType><xsd:complexType name="Note">
<xsd:simpleContent>
<xsd:extension base="xsd:string">
<xsd:attribute name="type" type="xsd:string" use="required" />
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType><xsd:complexType name="Partitioning">
<xsd:sequence>
<xsd:choice>
<xsd:element name="object-clusters" type="ObjectClusterSet" />
<xsd:element name="feature-clusters" type="FeatureClusterSet" />
<xsd:element name="biclusters" type="BiclusterSet" />
</xsd:choice>
<xsd:element name="cluster-parameters" type="Parameters" minOccurs="0" maxOccurs="1" />
<xsd:element name="validation" type="Validation" minOccurs="0" maxOccurs="unbounded" />
</xsd:sequence>
<xsd:attribute name="name" type="xsd:string" use="required"/>
<xsd:attribute name="method" type="xsd:string" use="required"/>
</xsd:complexType><xsd:complexType name="ObjectClusterSet">
<xsd:sequence>
<xsd:element name="cluster" type="ClusterOfObjects" minOccurs="1" maxOccurs="unbounded" />
</xsd:sequence>
</xsd:complexType><xsd:complexType name="FeatureClusterSet">
<xsd:sequence>
<xsd:element name="cluster" type="ClusterOfFeatures" minOccurs="1" maxOccurs="unbounded" />
</xsd:sequence>
</xsd:complexType><xsd:complexType name="BiclusterSet">
<xsd:sequence>
<xsd:element name="cluster" type="Bicluster" minOccurs="1" maxOccurs="unbounded" />
</xsd:sequence>
</xsd:complexType><xsd:complexType name="ClusterOfObjects">
<xsd:sequence>
<xsd:element name="object" type="Reference" minOccurs="1" maxOccurs="unbounded" />
</xsd:sequence>
<xsd:attribute name="name" type="xsd:string" use="required" />
</xsd:complexType><xsd:complexType name="ClusterOfFeatures">
<xsd:sequence>
<xsd:element name="feature" type="Reference" minOccurs="1" maxOccurs="unbounded" />
</xsd:sequence>
<xsd:attribute name="name" type="xsd:string" />
</xsd:complexType><xsd:complexType name="Bicluster">
<xsd:sequence>
<xsd:choice>
<xsd:element name="object" type="Reference" minOccurs="1" maxOccurs="unbounded" />
<xsd:element name="feature" type="Reference" minOccurs="1" maxOccurs="unbounded" />
</xsd:choice>
</xsd:sequence>
<xsd:attribute name="name" type="xsd:string" use="required" />
</xsd:complexType><xsd:complexType name="Reference">
<xsd:attribute name="name" type="xsd:string" use="required" />
</xsd:complexType><xsd:complexType name="Parameters">
<xsd:sequence>
<xsd:element name="parameter" type="Parameter" minOccurs="0" maxOccurs="unbounded" />
</xsd:sequence>
</xsd:complexType><xsd:complexType name="Parameter">
<xsd:attribute name="name" type="xsd:string" use="required" />
<xsd:attribute name="value" type="xsd:string" use="required" />
</xsd:complexType><xsd:complexType name="Validation">
<xsd:sequence>
<xsd:element name="validation-parameters" type="Parameters" />
<xsd:element name="validation-results" type="Results" />
</xsd:sequence>
<xsd:attribute name="method" type="xsd:string" use="required"/>
</xsd:complexType><xsd:complexType name="Results">
<xsd:sequence>
<xsd:element name="result" type="Result" minOccurs="1" maxOccurs="unbounded" />
</xsd:sequence>
</xsd:complexType><xsd:complexType name="Result">
<xsd:attribute name="name" type="xsd:string" use="required" />
<xsd:attribute name="value" type="xsd:float" use="required" />
</xsd:complexType></xsd:schema>
The data comprised seven leukaemia samples (3 acute myeloid leukaemia (AML) and 4 acute lymphoblastic leukaemia (ALL)) described by the expression levels of 5 genes with suspected roles in this type of cancer. These data were obtained from a study published by Golub and co-workers (Golub et al., 1999). The original data and experimental methods are available at http://www.genome.wi.mit.edu/MPR. Tab-delimited format is described here.
First, two clustering algorithms were applied to the dataset:
1) Hierarchical clustering of samples with the following parameters:
- N = 2;
- Euclidean metrics;
- Complete linkage;
- No transformation.
2) K-means clustering of genes with the following parameters:
- K = 2;
- Euclidean metrics;
- No transformation;
- Random all element initialization.
U22376 X59417 U05259 M92287 M31211 sample_31 AML 408 1784 272 1032 192 1 sample_32 AML 1047 1214 306 1024 339 1 sample_33 AML 335 1583 19 1827 59 1 sample_1 ALL 3105 3016 9326 4778 601 0 sample_2 ALL 1118 3424 895 2700 435 1 sample_3 ALL 4543 7724 628 4926 547 1 sample_4 ALL 5467 3821 5314 5403 472 0 1 1 0 1 0 Second, two validation algorithms were applied to the clustersets:
1) For cluster set of samples (which includes 2 clusters) - Dunn's index with the following parameters:
- Euclidean metrics;
- Complete linkage;
- Completediameter;
- No transformation.
1) For cluster set of genes (which includes 2 clusters) - Silhouette index with the following parameters:
- Euclidean metrics;
- No transformation.
Leukemia Example:
<dataset name="leukemia" xmlns:ds="urn:dataset">
<objects type="microarray-samples">
<object name="sample_31">
<feature name="U22376" value="408" />
<feature name="X59417" value="1784" />
<feature name="U05259" value="272" />
<feature name="M92287" value="1032" />
<feature name="M31211" value="192" />
</object>
<object name="sample_32">
<feature name="U22376" value="1047" />
<feature name="X59417" value="1214" />
<feature name="U05259" value="306" />
<feature name="M92287" value="1024" />
<feature name="M31211" value="339" />
</object>
<object name="sample_33">
<feature name="U22376" value="335" />
<feature name="X59417" value="1583" />
<feature name="U05259" value="19" />
<feature name="M92287" value="1827" />
<feature name="M31211" value="59" />
</object>
<object name="sample_1">
<feature name="U22376" value="3105" />
<feature name="X59417" value="3016" />
<feature name="U05259" value="9326" />
<feature name="M92287" value="4778" />
<feature name="M31211" value="601" />
</object>
<object name="sample_2">
<feature name="U22376" value="1118" />
<feature name="X59417" value="3424" />
<feature name="U05259" value="895" />
<feature name="M92287" value="2700" />
<feature name="M31211" value="435" />
</object>
<object name="sample_3">
<feature name="U22376" value="4543" />
<feature name="X59417" value="7724" />
<feature name="U05259" value="628" />
<feature name="M92287" value="4926" />
<feature name="M31211" value="547" />
</object>
<object name="sample_4">
<feature name="U22376" value="5467" />
<feature name="X59417" value="3821" />
<feature name="U05259" value="5314" />
<feature name="M92287" value="5403" />
<feature name="M31211" value="472" />
</object>
</objects><partitioning name="hierarchical clustering">
<object-clusters>
<cluster name="0">
<object name="sample_1" />
<object name="sample_4" />
</cluster>
<cluster name="1">
<object name="sample_31" />
<object name="sample_32" />
<object name="sample_33" />
<object name="sample_2" />
<object name="sample_3" />
</cluster>
</object-clusters><cluster-parameters>
<parameter name="Metrics" value="Euclidean" />
<parameter name="Intercluster metrics" value="Complete linkage" />
<parameter name="Transformation" value="No transformation" />
<parameter name="N" value="2" /></cluster-parameters>
<validation method="Dunn index">
<validation-parameters>
<parameter name="Metrics" value="Euclidean" />
<parameter name="Intercluster metrics" value="Complete linkage" />
<parameter name="Intracluster metrics" value="Complete diameter" />
<parameter name="Transformation" value="No transformation" />
</validation-parameters><validation-results>
<result name="Dunn's Index" value="1.227" />
</validation-results>
</validation>
</partitioning><partitioning name="kmeans clustering" method="K-means">
<feature-clusters>
<cluster name="0">
<feature name="U05259" />
<feature name="M31211" />
</cluster>
<cluster name="1">
<feature name="U22376" />
<feature name="X59417" />
<feature name="M92287" />
</cluster>
</feature-clusters><cluster-parameters>
<parameter name="Metrics" value="Euclidean" />
<parameter name="K" value="2" />
<parameter name="Transformation" value="No transformation" />
<parameter name="Initialization" value="Random all element initialization" />
</cluster-parameters><validation method="Silhouettes">
<validation-parameters>
<parameter name="Metrics" value="Euclidean" />
<parameter name="Transformation" value="No transformation" />
</validation-parameters><validation-results>
<result name="Average Silhouette" value="0.247" />
<result name="Cluster 0 Silhouette" value="-0.184" />
<result name="Cluster 1 Silhouette" value="0.534" />
</validation-results></validation>
</partitioning>
<partitioning name="biclustering">
<biclusters>
<cluster name="bicluster_1">
<object name="sample_31" />
<object name="sample_32" />
<feature name="U05259" />
<feature name="M92287" />
</cluster>
</biclusters>
</partitioning><note type="description">
This is a leukemia sample from Golub's research.
</note><partitioning name="natural classes">
<object-clusters>
<cluster name="AML">
<object name="sample_31" />
<object name="sample_32" />
<object name="sample_33" />
</cluster>
<cluster name="ALL">
<object name="sample_1" />
<object name="sample_2" />
<object name="sample_3" />
<object name="sample_4" />
</cluster>
</object-clusters></partitioning>
</dataset>