Paper 13407-111
Vision transformer for efficient chest x-ray and gastrointestinal image classification
19 February 2025 • 5:30 PM - 7:00 PM PST | Golden State Ballroom
Abstract
Medical image analysis has emerged as critical research domain because of its usefulness in different clinical applications, such as early disease diagnosis and treatment. Convolutional neural networks (CNNs) have become standard in medical image analysis due to their superior ability to interpret complex features, often outperforming humans. In addition to CNNs, transformer architectures also have gained popularity for medical image analysis tasks. However, despite progress in the field, there are still potential areas for improvement. This study evaluates and compares both CNNs and transformer-based methods, employing diverse data augmentation techniques, on three medical image datasets. For Chest X-ray, the vision transformer model achieved the highest F1 score of 0.9532 and Matthews correlation coefficient (MCC) of 0.9259. Similarly, for the Kvasir dataset, we achieved an F1 score of 0.9436 and MCC of 0.9360. For the Kvasir-Capsule, the ViT model achieved an F1-score of 0.7156 and an MCC of 0.3705. We found that the transformer-based models were better or more effective than various CNN models for classifying different anatomical structures, findings, and abnormalities in medic