Biomarkers
The identification of robust lists of molecular biomarkers related to a disease is a fundamental step for early diagnosis and treatment. However, methodologies for biomarker discovery using microarray data often provide results with limited overlap. It has been suggested that one reason for these inconsistencies may be that in complex diseases, such as cancer, multiple genes belonging to one or more physiological pathways are associated with the outcomes. Thus, a possible approach to improve list stability is to integrate biological information from genomic databases in the learning process. The following software is able to include biological information, in the form of similarity matrices, in the learning process.
A Brief Description
Let X be an n x m matrix, where n is the number of examples, each one described by a m gene expressions. Moreover, let S be a m x m similarity matrix between genes. Then a kernel function, incoporating S, can be built as follows: K=XPPX, where P=D^1(I+a(S-I)), I is the identity matrix, D is a diagonal matrix with elements corresponding to sums of elements in the rows/columns of (I+a(S-I)).
Download
The software can be downloaded here (use tar -xzvf biomarkers.tgz to uncompress the file). Type
make biom.perc.std
to compile the source code. The compressed archive includes a toy dataset (files trial-dataset.txt and trial-dataset-labels.txt) as well as a simple similarity matrix (file similarity_matrix).
Usage
The executable accepts the following arguments:
-
-E string -> the name of the dataset file (trial-dataset.txt for example).
-
-L string -> the name of the file with target labels.
-
-S string -> the name of the file with the similarity matrix S.
-
-W string -> the name of the output file with the weights of the genes
-
-a float -> parameter a (see discussion above)
-
-n [0..3] -> 0 no preprocessing of the data is performed.
1 data is modified in order to have zero mean
2 data is modified in order to have unit variance
3 data is modified in order to have zero mean and unit variance
./biom.perc.std -E trial-dataset.txt -L trial-dataset-labels.txt -S similarity_matrix -W weights -a 1 -n 3
The dataset file describes each example (one per row) by a list of real values (gene expressions). the first line of the file lists all gene identifiers and the first column of the file the identifier of the example.
The target label file is a two-column file. the first column represents the identifier of the example, while the second is the target of the example (0,1).
The similarity matrix file represents similarity values between examples (maximum value being 1). the first column lists genes identifiers.
References