Multi-Omics Integration: RA-ILD Diagnostic Biomarkers

Jun 5
5 min read

Roughly one in ten people with rheumatoid arthritis develops interstitial lung disease, and once it shows up the median survival sits near 2.6 years. Diagnosis still hinges on a radiologist reading high-resolution CT, a process that misses early fibrosis and disagrees from one reader to the next. A team working across four Chinese hospitals asked whether multi-omics integration could sharpen that call. From 278 rheumatoid arthritis patients they pulled blood transcriptomics, proteomics, and metabolomics, added CT radiomics and routine labs, then let machine learning choose the features that tell RA-ILD apart from RA without lung involvement.

What Multi-Omics Integration Adds to RA-ILD Diagnosis

The design split the work into two cohorts. Cohort 1, 109 patients from one tertiary hospital, trained and tuned the models; cohort 2, 169 patients pooled from three other centers, served as a blind external test. RNA-seq, 4D DIA mass spectrometry, and LC-MS metabolomics each went through their own feature selection using LASSO, support vector machines, and random forests. Of the four single-layer readouts, the metabolomics model carried the most diagnostic weight. A CatBoost classifier built on five serum metabolites reached an AUC of 0.982 in validation, beating the transcriptomic and radiomic models by a wide margin. That ordering fits a plain idea: metabolites sit furthest downstream, so they register the active disease state more directly than transcripts that regulatory feedback can buffer.

Key Findings

Metabolomics carried the strongest single-omics signal: a CatBoost model trained on five serum metabolites reached an AUC of 0.982 and 0.929 accuracy, outperforming the best transcriptomic (AUC 0.829) and radiomic (AUC 0.885) models.
Combining layers beat any single one: the integrated model, built from the top feature of each omics layer plus radiomics, held AUCs of 0.968 to 0.984 with positive predictive value at or above 0.885.
Four features did most of the lifting: the radiomic texture measure Kurtosis, the metabolite 6-Keto-PGF1alpha, the protein RDH11, and the transcript SYS1-DBNDD2, each of which separated the two groups and tracked disease biology.
The clinical-radiomics nomogram travelled well: pairing a Rad-score with six routine labs gave an internal AUC of 0.963 and held 0.913 on the external multicenter cohort, where single-feature models slipped.

Multi-omics integration workflow flowchart showing patient cohorts, feature selection, and machine learning diagnostic models for RA-ILD.

Figure 1. Study workflow. The flowchart traces the analysis from enrollment of 278 rheumatoid arthritis patients across an internal cohort (n=109) and an external validation cohort (n=169) through multi-omics and radiomics data collection, feature selection with random forest, SVM, LASSO, and a Transformer, and construction of single-omics, multi-omics, and clinical-radiomics models. Endpoints include the best single-omics model (metabolomics CatBoost, AUC 0.982), the integrated model (AUC 0.968 to 0.984), biomarker identification, and a clinical risk nomogram. Adapted from Wu et al. (2026), Clinics.

How Machine Learning Ranked the Omics Layers

Before any model was fit, partial least squares discriminant analysis showed how cleanly each layer split the groups. Radiomics separated best on its own; transcriptomics separated worst. Four algorithms, LASSO, random forest, LightGBM, and CatBoost, were then trained per layer and judged on a held-out set. The MS-based proteomics models stayed stable, every algorithm clearing an AUC of 0.945, while the metabolomics CatBoost model edged ahead of the field. The authors leaned on SHAP values to read each feature's contribution, which is what let them carry just the single strongest variable from each layer into the combined model rather than dumping in hundreds of correlated features.

Biomarkers That Track Inflammation and Lung Function

The four surviving features are not just statistically convenient; they map onto known biology. 6-Keto-PGF1alpha, a stable breakdown product of prostacyclin, rose in RA-ILD and climbed alongside CRP, neutrophil count, and anti-CCP titres, marking residual inflammation that standard anti-inflammatory drugs had not switched off. The proteomic marker RDH11, an enzyme in retinol metabolism, fell as gas-exchange efficiency worsened. The radiomic Kurtosis feature tracked fibrosis severity and, on pathway analysis, lined up with TLR and JAK-STAT signaling. Earlier multi-omics factor work from Argelaguet and colleagues (2018) made the same general case, that a few well-chosen latent features can outperform a kitchen-sink model, and this integrative omics study is a clean clinical example of it.

Running Three Omics From One Blood Draw

The result that the integrated model beat every single-layer model is the whole argument for measuring more than one omic at once. Proteins and metabolites report on different parts of the same disease, and when they come off separate runs the batch effects and split sample volumes blur the comparison. Pulling proteomics and metabolomics from a single injection, which the Omni-MS workflow at Dalton is built to do, keeps those layers on the same analytical footing so the integration step isn't fighting technical noise. For a cohort this small, that footing probably matters more than the choice of classifier.

Frequently Asked Questions

What is multi-omics integration?

Multi-omics integration combines measurements from different molecular layers, such as transcripts, proteins, and metabolites, into one analysis instead of studying each in isolation. The aim is to catch disease signals that any single layer would miss. In this study it also folded in CT radiomics and routine clinical labs.

How accurate was the multi-omics integration model for RA-ILD?

The integrated model held an AUC between 0.968 and 0.984 on the validation set, beating every single-omics model. Its positive predictive value stayed at or above 0.885. A separate clinical-radiomics model reached 0.913 on an external multicenter cohort.

What biological samples does this kind of omics workup need?

The team used peripheral blood for transcriptomics, proteomics, and metabolomics, plus existing high-resolution CT scans and standard lab values. No lung biopsy was required, which makes the workflow far less invasive than tissue-based diagnosis. It does still depend on access to mass spectrometry and sequencing.

Conclusion

On this evidence, blood metabolites paired with CT texture features can flag RA-ILD about as well as expert imaging review, and the external-cohort numbers suggest the signal is not a single-center artifact. What is not settled is causation: the design is cross-sectional, the sample is small for high-dimensional data, and mixed drug regimens could shape the metabolite profiles. For groups building RA-ILD screening tools, the sensible next step is a prospective, treatment-stratified cohort before any of these markers go near the clinic.

Citation

Wu, D., Chen, J., Liang, H., Chen, C., Liang, M., Liao, C., He, X., Zhai, J., Dai, M., Lu, X., Zeng, F., & Zou, Q. (2026). Construction of Rheumatoid Arthritis-Associated Interstitial Lung Disease diagnostic model and identification of biomarkers based on a multi-omics integration strategy of machine learning. Clinics, 81, 100933. https://doi.org/10.1016/j.clinsp.2026.100933

Note

This blog post summarizes findings from the above-cited research. Figures are adapted from the original publication. For full details, please refer to the source article.

By Seungjun Yeo, Co-Founder and CEO at Dalton Bioanalytics. Specializing in multi-omics mass spectrometry for drug discovery and biomarker research.

Subscribe to Our Newsletter