"Never doubt that a small group of thoughtful, committed citizens can change the world. Indeed, it is the only thing that ever has."

Margaret Mead
Research Article

Constructing Endophenotypes of Complex Diseases Using Non-Negative Matrix Factorization and Adjusted Rand Index


Complex diseases are typically caused by combinations of molecular disturbances that vary widely among different patients. Endophenotypes, a combination of genetic factors associated with a disease, offer a simplified approach to dissect complex trait by reducing genetic heterogeneity. Because molecular dissimilarities often exist between patients with indistinguishable disease symptoms, these unique molecular features may reflect pathogenic heterogeneity. To detect molecular dissimilarities among patients and reduce the complexity of high-dimension data, we have explored an endophenotype-identification analytical procedure that combines non-negative matrix factorization (NMF) and adjusted rand index (ARI), a measure of the similarity of two clusterings of a data set. To evaluate this procedure, we compared it with a commonly used method, principal component analysis with k-means clustering (PCA-K). A simulation study with gene expression dataset and genotype information was conducted to examine the performance of our procedure and PCA-K. The results showed that NMF mostly outperformed PCA-K. Additionally, we applied our endophenotype-identification analytical procedure to a publicly available dataset containing data derived from patients with late-onset Alzheimer’s disease (LOAD). NMF distilled information associated with 1,116 transcripts into three metagenes and three molecular subtypes (MS) for patients in the LOAD dataset: MS1 (), MS2 (), and MS3 (). ARI was then used to determine the most representative transcripts for each metagene; 123, 89, and 71 metagene-specific transcripts were identified for MS1, MS2, and MS3, respectively. These metagene-specific transcripts were identified as the endophenotypes. Our results showed that 14, 38, 0, and 28 candidate susceptibility genes listed in AlzGene database were found by all patients, MS1, MS2, and MS3, respectively. Moreover, we found that MS2 might be a normal-like subtype. Our proposed procedure provides an alternative approach to investigate the pathogenic mechanism of disease and better understand the relationship between phenotype and genotype.