The goal of this project is to create a multi-modal diagnostic system that can detect COVID-19, pneumonia, and other respiratory illnesses by combining medical imaging data (like chest X-rays and CT scans) with clinical records (such as symptoms, lab values, and demographic details).
The first and most crucial step toward building a high-performing AI model is data preprocessing — this project focuses deeply on that phase, ensuring the data is clean, normalized, well-aligned, and suitable for training a robust diagnostic model.
Problem | Solution |
---|---|
Missing values | Filled using statistical imputation (mean, median) or kNN |
Inconsistent entries | Unified using mapping (e.g., "Fever", "fever" → "fever") |
Mixed data types | Encoded categorical data (e.g., gender: 0 for Male, 1 for Female) |
Different scales | Normalized numerical values (like CRP, oxygen saturation) |
Redundant features | Removed irrelevant or highly correlated columns |
Task | Technique |
---|---|
Image format | Converted DICOM to PNG or JPG |
Resize | Standardized to fixed size (e.g., 224*224 or 256*256) |
Normalize | Pixel intensity scaled to 0-1 or standardized |
Data augmentation | Flip, rotate, zoom, shift (to prevent overfitting) |
Noise removal | Denoising filters or histogram equalization |
Optional step | Lung segmentation using pre-trained U-Net model |