The goal of this project is to create a multi-modal diagnostic system that can detect COVID-19, pneumonia, and other respiratory illnesses by combining medical imaging data (like chest X-rays and CT scans) with clinical records (such as symptoms, lab values, and demographic details).
The first and most crucial step toward building a high-performing AI model is data preprocessing — this project focuses deeply on that phase, ensuring the data is clean, normalized, well-aligned, and suitable for training a robust diagnostic model.
| Problem | Solution |
|---|---|
| Missing values | Filled using statistical imputation (mean, median) or kNN |
| Inconsistent entries | Unified using mapping (e.g., "Fever", "fever" → "fever") |
| Mixed data types | Encoded categorical data (e.g., gender: 0 for Male, 1 for Female) |
| Different scales | Normalized numerical values (like CRP, oxygen saturation) |
| Redundant features | Removed irrelevant or highly correlated columns |
| Task | Technique |
|---|---|
| Image format | Converted DICOM to PNG or JPG |
| Resize | Standardized to fixed size (e.g., 224*224 or 256*256) |
| Normalize | Pixel intensity scaled to 0-1 or standardized |
| Data augmentation | Flip, rotate, zoom, shift (to prevent overfitting) |
| Noise removal | Denoising filters or histogram equalization |
| Optional step | Lung segmentation using pre-trained U-Net model |