Comparing Machine Learning Algorithms for Identifying Antibiotic Resistance Genes (ARGs) in the Agricultural Soil Microbiome; Case Study in East Asia

Document Type : Research Paper

Authors

1 Department of Soil Science, Faculty of Agriculture, Ferdowsi University of Mashhad, Mashhad, Iran

2 Soil Reclamation and Sustainable Land Management Research Department,, Soil and Water Research Institute of Iran, Agricultural Research, Education and Extension Organization (AREEO), Karaj, Iran

3 Department of Computer ,, Jihad Daneshgahi, of Khorasan Razavi, Mashhad, Iran

Abstract

With the escalating threat of antibiotic resistance, the accurate and comprehensive identification of antibiotic resistance genes (ARGs) in natural environments, particularly agricultural soils, has become a major concern in public and environmental health. In recent years, the application of machine learning algorithms has gained attention as a novel approach for analyzing complex metagenomic data and improving ARG detection. In this study, four machine learning algorithms—Random Forest, Gradient Boosting (XGBoost), Support Vector Machine (SVM), and Multilayer Perceptron (MLP)—were compared for their ability to identify resistance genes in the agricultural soil microbiome in India and China. Metagenomic data were obtained from the NCBI database and processed using the ARGs-OAP tool. A set of biological features, including GC content, amino acid frequency, and codon usage, was extracted. Statistical differences between resistant and non-resistant genes were assessed using the Mann–Whitney U test, and only significant features were selected for model training. The results demonstrated that the models, particularly Random Forest (with 98% accuracy), were capable of identifying resistance genes with high performance, even under conditions of imbalanced data and limited training sample size. These findings highlight the effectiveness of the selected biological features and machine learning algorithms in detecting ARGs in the agricultural soil microbiome in East Asia. This approach could serve as an efficient tool for environmental monitoring and policy-making aimed at controlling the spread of antibiotic resistance.

Keywords

Main Subjects


Introduction

The spread of antibiotic resistance genes (ARGs) in agricultural soils has emerged as a critical public health threat, with soils serving as both reservoirs and transmission pathways for resistance determinants. Accurate identification of ARGs within complex soil metagenomic data is essential for monitoring resistance dissemination, yet conventional alignment-based methods remain limited to detecting known resistance genes and fail to identify novel or divergent variants. This study applies four machine learning algorithms, Random Forest, Support Vector Machine (SVM), XGBoost, and Multilayer Perceptron (MLP), to classify ARGs based on biologically informative sequence features, including GC content, codon usage, and amino acid composition. By systematically evaluating model performance under imbalanced data conditions and limited training samples, this work provides a comparative assessment of machine learning approaches for resistance gene detection and demonstrates the utility of interpretable features in distinguishing resistant from non-resistant sequences in agricultural soil microbiomes.

Objective(s)

The specific objectives of this study were to: (1) develop and train four machine learning models (Random Forest, SVM, XGBoost, and MLP) for binary classification of antibiotic resistance genes (ARGs) versus non-ARGs using sequence-derived features (GC content, codon frequency, and amino acid composition); (2) compare model performance using precision, recall, F1-score, and cross-validated AUC-ROC and AUC-PRC metrics under varying training/test splits (80/20, 70/30, 60/40, and 10/90); (3) identify the most influential biological features contributing to ARG classification through feature importance analysis; and (4) evaluate model robustness and generalization capability under class-imbalanced conditions representative of real-world metagenomic datasets.

Methods

This study utilized metagenomic sequence data from agricultural soil microbiomes obtained from the NCBI database and processed through the ARGs-OAP pipeline against the SARG database. A total of approximately 400,000 sequences were initially retrieved, with quality control performed using FastQC and Cutadapt. Biological features, including GC content, amino acid composition (21 amino acids), and codon usage frequency (64 codons), were extracted from each sequence. Given the highly imbalanced nature of the dataset (50 ARGs vs. hundreds of thousands of non-ARGs), a balanced subset was constructed by retaining all 50 ARG sequences and selecting 150 non-ARG sequences with statistically significant feature differences (Mann-Whitney U test, p < 0.05) and GC content restricted to 10-30%. Four machine learning algorithms, Random Forest, Support Vector Machine (SVM), XGBoost, and Multilayer Perceptron (MLP), were applied to the final dataset of 200 samples. Model performance was evaluated using precision, recall, and F1-score, with cross-validation performed under multiple training/test split ratios (80/20, 70/30, 60/40, and 10/90). Feature importance analysis was conducted to identify the most discriminative biological predictors.

Results

All four machine learning models demonstrated strong performance in ARG classification. Random Forest achieved the highest overall accuracy (0.9877) with perfect recall for the non-resistant class (1.000) and near-perfect recall for resistant genes (0.9524). MLP showed the highest cross-validated mean F1-score (0.9581 ± 0.0656) and achieved perfect recall for resistant genes (0.9524) with minimal false positives. SVM performed reliably with 0.9753 accuracy but showed lower recall for resistant genes (0.9048). XGBoost exhibited the weakest performance among the four models, with the lowest recall for resistant genes (0.8571) and lowest cross-validated F1-score (0.9379 ± 0.0791). AUC-ROC and AUC-PRC analyses confirmed Random Forest as the top performer (AUC-ROC = 0.987 ± 0.005; AUC-PRC = 0.984 ± 0.006), followed closely by MLP (AUC-ROC = 0.981 ± 0.006; AUC-PRC =0.976 ± 0.007). Confusion matrices revealed that Random Forest and MLP misclassified only one sample each, while XGBoost misclassified three resistant genes. Feature importance analysis consistently identified specific codons, particularly CTG (Leucine), GCG (Alanine), and CGC (Arginine), along with GC content, as the most influential predictors across all models.

Conclusions

This study demonstrates that machine learning models, particularly Random Forest and Multilayer Perceptron, can effectively identify antibiotic resistance genes in agricultural soil microbiomes using biologically derived sequence features, even under class-imbalanced conditions and limited sample sizes. Key biological predictors, including specific codon usage patterns (CTG, GCG, CGC), amino acid composition (Leucine, Valine), and GC content, were identified as robust discriminators between resistant and non-resistant sequences. These findings confirm that sequence-intrinsic features alone can provide sufficient signal for ARG detection independent of homology-based methods. The superior performance of ensemble and neural network approaches suggests their potential for integration into environmental monitoring pipelines. Future research should focus on validating these models on larger, geographically diverse datasets and incorporating additional feature types (e.g., genomic context, mobile genetic elements) to enhance generalizability and enable real-world deployment for antibiotic resistance surveillance in agricultural ecosystems.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Authorship contribution

KeshikNevisRazavi, S.R. contributed to the conceptualization, methodology, software development, investigation, data curation, writing of the original draft, and manuscript review and editing. Farahani, E. was involved in validation, writing of the original draft, review and editing of the manuscript, supervision of the research process, and overall project administration. Emami, H. contributed to the validation, formal analysis, investigation, data curation, and writing of the original draft. Abedinzadeh, N. participated in software development, data curation, and writing of the original draft. Abdolahi, M. contributed to the methodology, software implementation, validation, formal analysis, investigation, data curation, writing of the original draft, review and editing and visualization of results. All authors have read and approved the final version of the manuscript.

Declaration of Generative AI and AI-assisted technologies in the writing process

The authors declare that no AI and AI-assisted technologies is used in this article.

Data availability statement

The data and materials used in the study can be available base on a reasonable request.

Acknowledgements

The authors would like to thank the contributors and maintainers of the NCBI database and the developers of the ARGs-OAP pipeline for providing open-access resources that made this study possible.

Ethical considerations

No human or animal subjects were involved, and therefore no ethical approval was required.

Conflict of interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Andrews, S. (2010). *FastQC: A quality control tool for high throughput sequence data*. Babraham Bioinformatics, Babraham Institute, Cambridge, United Kingdom.
Arango-Argoty, G., Garner, E., Pruden, A., et al. (2018). DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data. *Microbiome*, 6(1), 23.
Aydın, Y., Işıkdağ, U., Bekdaş, G., Nigdeli, S. M., & Geem, Z. W. (2023). Use of machine learning techniques in soil classification. *Sustainability*, 15(3), 2374.
Bai, Y., Ruan, X., Li, R., et al. (2024). Distribution and diversity of antibiotic resistance genes across human, poultry, pig, and soil microbiomes. *Science of the Total Environment*, 912, 168901.
Berendonk, T. U., Manaia, C. M., Merlin, C., et al. (2015). Tackling antibiotic resistance: The environmental framework. *Nature Reviews Microbiology*, 13(5), 310-317.
Bradley, P., Gordon, N. C., Walker, T. M., et al. (2015). Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis. *Nature Communications*, 6, 10063.
Breiman, L. (2001). Random forests. *Machine Learning*, 45(1), 5–32.
Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. *Nature Methods*, 12, 59-60.
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining* (pp. 785–794). ACM.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. *Machine Learning*, 20(3), 273–297.
Davis, J. J., Boisvert, S., Brettin, T., et al. (2016). Antimicrobial resistance prediction in PATRIC and RAST. *Scientific Reports*, 6, 27930.
Delgado-Baquerizo, M., Hu, H. W., Maestre, F. T., et al. (2022). The global distribution and environmental drivers of the soil antibiotic resistome. *Microbiome*, 10(1), 219.
Deng, Z., et al. (2025). Ecological distribution, dissemination potential, and health risks of antibiotic resistance genes and mobile genetic elements in soils across diverse land-use types in China. *Environmental Research*, 285(Pt 2), 122459.
Forsberg, K. J., Patel, S., Gibson, M. K., et al. (2014). Bacterial phylogeny structures soil resistomes across habitats. *Nature*, 509, 612-616.
Gandhi, N. R., Nunn, P., Dheda, K., et al. (2010). Multidrug-resistant and extensively drug-resistant tuberculosis: A threat to global control of tuberculosis. *Lancet*, 375, 1830-1843.
Gillings, M. R. (2014). Integrons: past, present, and future. *Microbiology and Molecular Biology Reviews*, 78(2), 257-277.
Her, H. L., & Wu, Y. W. (2024). Pan-genome approach with genetic algorithm identifies non-core gene clusters for antibiotic resistance prediction in Escherichia coli. *Briefings in Bioinformatics*, 25(2), bbae078.
Hinsu, A. T., Panchal, K. J., Pandit, R. J., Koringa, P. G., & Kothari, R. K. (2021). Characterizing rhizosphere microbiota of peanut (Arachis hypogaea L.) from pre-sowing to post-harvest of crop under field conditions. *Scientific reports*, 11(1), 17457.
Hinsu, A., Dumadiya, A., Joshi, A., Kotadiya, R., Andharia, K., Koringa, P., & Kothari, R. (2021). To culture or not to culture: a snapshot of culture-dependent and culture-independent bacterial diversity from peanut rhizosphere. *PeerJ*, 9, e12035.
Hu, Y., Liu, F., Lin, I. Y., et al. (2016). Dissemination of the mcr-1 colistin resistance gene. *Lancet Infectious Diseases*, 16, 146-147.
Hyun, J. C., et al. (2024). A pan-genome framework for antibiotic resistance prediction in Staphylococcus aureus, Pseudomonas aeruginosa, and Escherichia coli. *mSystems*, 9(1), e00982-23.
Jia, B., Raphenya, A. R., Alcock, B., et al. (2017). CARD 2017: Expansion and model-centric curation of the comprehensive antibiotic resistance database. *Nucleic Acids Research*, 45, D566-D573.
Kleinheinz, K. A., Joensen, K. G., & Larsen, M. V. (2014). Applying the ResFinder and VirulenceFinder web-services for easy identification of acquired antibiotic resistance and E. coli virulence genes in bacteriophage and prophage nucleotide sequences. *Bacteriophage*, 4(2), e27943.
Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. *Genome Biology*, 10, R25.
Ma, R. A., et al. (2025). A machine learning approach to predict phyllosphere resistome abundance across urbanization gradients. *Environment International*, 202, 109655.
Maciel-Guerra, A., et al. (2022). Dissecting microbial communities and resistomes for interconnected humans, soil, and livestock. *The ISME Journal*, 16, 21-32.
Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. *EMBnet.journal*, 17(1), 10–12.
Martínez, J. L., Coque, T. M., & Baquero, F. (2015). What is a resistance gene? Ranking risk in resistomes. *Nature Reviews Microbiology*, 13(2), 116-123.
McArthur, A. G., & Tsang, K. K. (2017). Antimicrobial resistance surveillance in the genomic age. *Annals of the New York Academy of Sciences*, 1388, 78-91.
Mediavilla, J. R., Patrawalla, A., Chen, L., et al. (2016). Colistin- and carbapenem-resistant Escherichia coli harboring mcr-1 and blaNDM-5. *mBio*, 7, e01191-16.
Novielli, P., Romano, D., Magarelli, M., Bitonto, P. D., Diacono, D., Chiatante, A., Lopalco, G., Sabella, D., Venerito, V., Filannino, P., et al. (2024). Explainable artificial intelligence for microbiome data analysis in colorectal cancer biomarker identification. *Frontiers in Microbiology*, 15, 1348974.
O'Neill, J. (2016). *Tackling drug-resistant infections globally: Final report and recommendations*. Review on Antimicrobial Resistance.
Pal, C., Bengtsson-Palme, J., Kristiansson, E., & Larsson, D. G. (2016). The structure and diversity of human, animal and environmental resistomes. *Microbiome*, 4, 54.
Pataki, B. Á., et al. (2024). Random Forest-based prediction of ciprofloxacin minimum inhibitory concentration in Escherichia coli using whole-genome sequencing data. *Journal of Antimicrobial Chemotherapy*, 79(3), 512-521.
Pehrsson, E. C., Tsukayama, P., Patel, S., et al. (2016). Interconnected microbiomes and resistomes in low-income human habitats. *Nature*, 533, 212-216.
Poole, K. (2005). Efflux-mediated antimicrobial resistance. *Journal of Antimicrobial Chemotherapy*, 56(1), 20-51.
Pruden, A., Larsson, D. J., Amézquita, A., et al. (2013). Management options for reducing the release of antibiotics and antibiotic resistance genes to the environment. *Environmental Health Perspectives*, 121, 878.
Rowe, W., Baker, K. S., Verner-Jeffreys, D., et al. (2015). SEAR: A cloud-compatible web pipeline for detecting antimicrobial resistance genes. *PLOS One*, 10, e0133492.
Scaglione, G., Mastroianni, N., Rizzo, A., et al. (2026). Integrating artificial intelligence with genome sequencing against antimicrobial resistance: a narrative review. *Frontiers in Public Health*, 14, 1757161.
Seah, C., Alexander, D. C., Louie, L., Simor, A., Low, D. E., Longtin, J., & Melano, R. G. (2012). MupB, a new high-level mupirocin resistance mechanism in Staphylococcus aureus. *Antimicrobial Agents and Chemotherapy*, 56(4), 1916-1920.
States, D. J., & Agarwal, P. (1996). Compact encoding strategies for DNA sequence similarity search. In *Proceedings of the International Conference on Intelligent Systems for Molecular Biology* (Vol. 4, pp. 211-217).
Strahilevitz, J., Jacoby, G. A., Hooper, D. C., & Robicsek, A. (2009). Plasmid-mediated quinolone resistance: a multifaceted threat. *Clinical Microbiology Reviews*, 22(4), 664-689.
Vuong, C., Yeh, A. J., Cheung, G. Y., & Otto, M. (2016). Investigational drugs to treat methicillin-resistant Staphylococcus aureus. *Expert Opinion on Investigational Drugs*, 25, 73-93.
Wang, T., Hansen, K. R., Loving, J., Paschalidis, I. C., van Aggelen, H., & Simhon, E. (2021). Predicting antimicrobial resistance in the intensive care unit. *arXiv preprint*, arXiv:2111.03575.
World Health Organization. (2025). *Global Antimicrobial Resistance and Use Surveillance System (GLASS) Report 2025*. Geneva: WHO.
Yang, L., Heckmann, D., Monk, J. M., Kavvas, E., & Palsson, B. O. (2020). A biochemically-interpretable machine learning classifier for microbial GWAS. *Nature Communications*, 11(1), 1-11.
Yang, Y., Li, B., Ju, F., & Zhang, T. (2013). Exploring variation of antibiotic resistance genes in activated sludge over a four-year period through a metagenomic approach. *Environmental Science & Technology*, 47(18), 10197-10205.
Yang, L. Y., Lin, C. S., Huang, X. R., Neilson, R., & Yang, X. R. (2022). Effects of biofertilizer on soil microbial diversity and antibiotic resistance genes. *Science of the Total Environment*, 820, 153170.
Yin, X., Jiang, X., Chai, B., Li, L., Yang, Y., Cole, J. R., Tiedje, J. M., & Zhang, T. (2018). ARGs-OAP v2.0 with an expanded SARG database and hidden Markov models for enhancement characterization and quantification of antibiotic resistance genes in environmental metagenomes. *Bioinformatics*, 34, 2263–2270.
Yin, X., Zheng, X., Li, L., et al. (2022). ARGs-OAP v3.0: Antibiotic-Resistance Gene Database Curation and Analysis Pipeline Optimization. *Engineering*.
Zheng, D., Yin, G., Liu, M., Chen, C., Jiang, Y., Hou, L., & Chen, H. (2023). A systematic review of antibiotic resistance genes and their associations with bacterial pathogens in aquatic ecosystems. *Ecotoxicology and Environmental Safety*, 251, 114521.