HEMO 2025 / III Simpósio Brasileiro de Citometria de Fluxo
Mais dadosHuman hematopoiesis is a dynamic process initiated during embryonic development in distinct anatomical niches, culminating in the formation of hematopoietic stem cells (HSCs), responsible for lifelong blood cell production. With aging, HSCs undergo functional decline, contributing to hematological disorders such as acute myeloid leukemia (AML). In these conditions, altered stem cells may drive disease progression and resist treatment. Advances in single- cell transcriptomics (scRNA-seq) have enabled high-resolution analysis of cellular heterogeneity in both healthy and diseased tissues. To manage the complexity of such data, machine learning (ML) algorithms have been increasingly employed to uncover gene expression patterns, differentiation states, and pathological features. These models outperform traditional marker-based methods by classifying cell types, estimating dedifferentiation levels, and detecting resistant subpopulations. In this context, algorithms such as One-Class Logistic Regression (OCLR) and decision tree-based methods have shown promise in identifying molecular stem cell signatures across biological settings. Once trained, these models can be applied to new datasets to recognize cells with similar profiles, aiding in the functional annotation of both normal and malignant samples.
ObjectivesThe aim of this study was to train machine learning models on public bone marrow scRNA-seq datasets to identify cells with a stemness profile, apply these models to transcriptomic data to classify samples, and evaluate their clinical impact.
Material and methodsML models including OCLR, Random Forest, and linear-kernel Support Vector Machine (SVM), were used to train classifiers on public bone marrow scRNA-seq datasets. The models were applied using Spearman correlation on normalized and scaled raw counts from transcriptomic data of the TCGA AML cohort (n = 151) and two public cohorts with healthy samples (n = 101). A z-score was calculated as: z = sample score – mean (healthy)/ SD (healthy). Scores above 1.96 were considered indicative of high stemness. Hazard ratios were calculated using Cox proportional hazards models.
ResultsAll models achieved comparable performance in metrics such as AUC and accuracy, with Random Forest showing higher Area Under Precision Recall Curve (AUPRC) in external validation and statistically outperforming SVM (p = 0.0380, Nemenyi post-test). Survival analysis revealed that the Random Forest model was significantly associated with overall survival. Patients in the high-stemness group (z-score > 1.96) had a hazard ratio (HR) of 1.73 (95% CI: 1.03–2.89, Logrank p value = 0.0344) compared to the low-stemness group. The median survival was 0.75 years for the high group and 1.59 years for the low group. In contrast, no significant association was observed for the OCLR (HR = 1.02, 95% CI: 0.59–1.76, p = 0.922) or SVM (HR = 1.15, 95% CI: 0.66–2.02, p = 0.600) models.
Discussion and conclusionIn conclusion, although all models demonstrated similar discriminative performance, the Random Forest approach not only achieved superior AUPRC in external validation but also showed a significant prognostic association with overall survival. These findings suggest that, among the tested stemness scoring methods, the Random Forest–derived HSCsi model may provide greater clinical utility by integrating predictive accuracy with prognostic relevance.
FundingThis work was supported by the National Council for Scientific and Technological Development (CNPq), Brazil.




