Document Type : Research Paper
Authors
1 Department of Soil Science, Faculty of Agriculture, Ferdowsi University of Mashhad, Mashhad, Iran
2 Soil Reclamation and Sustainable Land Management Research Department,, Soil and Water Research Institute of Iran, Agricultural Research, Education and Extension Organization (AREEO), Karaj, Iran
3 Department of Computer ,, Jihad Daneshgahi, of Khorasan Razavi, Mashhad, Iran
Abstract
Keywords
Main Subjects
The spread of antibiotic resistance genes (ARGs) in agricultural soils has emerged as a critical public health threat, with soils serving as both reservoirs and transmission pathways for resistance determinants. Accurate identification of ARGs within complex soil metagenomic data is essential for monitoring resistance dissemination, yet conventional alignment-based methods remain limited to detecting known resistance genes and fail to identify novel or divergent variants. This study applies four machine learning algorithms, Random Forest, Support Vector Machine (SVM), XGBoost, and Multilayer Perceptron (MLP), to classify ARGs based on biologically informative sequence features, including GC content, codon usage, and amino acid composition. By systematically evaluating model performance under imbalanced data conditions and limited training samples, this work provides a comparative assessment of machine learning approaches for resistance gene detection and demonstrates the utility of interpretable features in distinguishing resistant from non-resistant sequences in agricultural soil microbiomes.
The specific objectives of this study were to: (1) develop and train four machine learning models (Random Forest, SVM, XGBoost, and MLP) for binary classification of antibiotic resistance genes (ARGs) versus non-ARGs using sequence-derived features (GC content, codon frequency, and amino acid composition); (2) compare model performance using precision, recall, F1-score, and cross-validated AUC-ROC and AUC-PRC metrics under varying training/test splits (80/20, 70/30, 60/40, and 10/90); (3) identify the most influential biological features contributing to ARG classification through feature importance analysis; and (4) evaluate model robustness and generalization capability under class-imbalanced conditions representative of real-world metagenomic datasets.
This study utilized metagenomic sequence data from agricultural soil microbiomes obtained from the NCBI database and processed through the ARGs-OAP pipeline against the SARG database. A total of approximately 400,000 sequences were initially retrieved, with quality control performed using FastQC and Cutadapt. Biological features, including GC content, amino acid composition (21 amino acids), and codon usage frequency (64 codons), were extracted from each sequence. Given the highly imbalanced nature of the dataset (50 ARGs vs. hundreds of thousands of non-ARGs), a balanced subset was constructed by retaining all 50 ARG sequences and selecting 150 non-ARG sequences with statistically significant feature differences (Mann-Whitney U test, p < 0.05) and GC content restricted to 10-30%. Four machine learning algorithms, Random Forest, Support Vector Machine (SVM), XGBoost, and Multilayer Perceptron (MLP), were applied to the final dataset of 200 samples. Model performance was evaluated using precision, recall, and F1-score, with cross-validation performed under multiple training/test split ratios (80/20, 70/30, 60/40, and 10/90). Feature importance analysis was conducted to identify the most discriminative biological predictors.
All four machine learning models demonstrated strong performance in ARG classification. Random Forest achieved the highest overall accuracy (0.9877) with perfect recall for the non-resistant class (1.000) and near-perfect recall for resistant genes (0.9524). MLP showed the highest cross-validated mean F1-score (0.9581 ± 0.0656) and achieved perfect recall for resistant genes (0.9524) with minimal false positives. SVM performed reliably with 0.9753 accuracy but showed lower recall for resistant genes (0.9048). XGBoost exhibited the weakest performance among the four models, with the lowest recall for resistant genes (0.8571) and lowest cross-validated F1-score (0.9379 ± 0.0791). AUC-ROC and AUC-PRC analyses confirmed Random Forest as the top performer (AUC-ROC = 0.987 ± 0.005; AUC-PRC = 0.984 ± 0.006), followed closely by MLP (AUC-ROC = 0.981 ± 0.006; AUC-PRC =0.976 ± 0.007). Confusion matrices revealed that Random Forest and MLP misclassified only one sample each, while XGBoost misclassified three resistant genes. Feature importance analysis consistently identified specific codons, particularly CTG (Leucine), GCG (Alanine), and CGC (Arginine), along with GC content, as the most influential predictors across all models.
This study demonstrates that machine learning models, particularly Random Forest and Multilayer Perceptron, can effectively identify antibiotic resistance genes in agricultural soil microbiomes using biologically derived sequence features, even under class-imbalanced conditions and limited sample sizes. Key biological predictors, including specific codon usage patterns (CTG, GCG, CGC), amino acid composition (Leucine, Valine), and GC content, were identified as robust discriminators between resistant and non-resistant sequences. These findings confirm that sequence-intrinsic features alone can provide sufficient signal for ARG detection independent of homology-based methods. The superior performance of ensemble and neural network approaches suggests their potential for integration into environmental monitoring pipelines. Future research should focus on validating these models on larger, geographically diverse datasets and incorporating additional feature types (e.g., genomic context, mobile genetic elements) to enhance generalizability and enable real-world deployment for antibiotic resistance surveillance in agricultural ecosystems.
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
KeshikNevisRazavi, S.R. contributed to the conceptualization, methodology, software development, investigation, data curation, writing of the original draft, and manuscript review and editing. Farahani, E. was involved in validation, writing of the original draft, review and editing of the manuscript, supervision of the research process, and overall project administration. Emami, H. contributed to the validation, formal analysis, investigation, data curation, and writing of the original draft. Abedinzadeh, N. participated in software development, data curation, and writing of the original draft. Abdolahi, M. contributed to the methodology, software implementation, validation, formal analysis, investigation, data curation, writing of the original draft, review and editing and visualization of results. All authors have read and approved the final version of the manuscript.
The authors declare that no AI and AI-assisted technologies is used in this article.
The data and materials used in the study can be available base on a reasonable request.
The authors would like to thank the contributors and maintainers of the NCBI database and the developers of the ARGs-OAP pipeline for providing open-access resources that made this study possible.
No human or animal subjects were involved, and therefore no ethical approval was required.
The authors declare that there is no conflict of interest regarding the publication of this paper.