Project Title: Data Preprocessing-Multi-Modal Diagnosis Model for COVID-19 & Other Respiratory Diseases

Project Summary

The goal of this project is to create a multi-modal diagnostic system that can detect COVID-19, pneumonia, and other respiratory illnesses by combining medical imaging data (like chest X-rays and CT scans) with clinical records (such as symptoms, lab values, and demographic details).

The first and most crucial step toward building a high-performing AI model is data preprocessing — this project focuses deeply on that phase, ensuring the data is clean, normalized, well-aligned, and suitable for training a robust diagnostic model.

Objective

Types of Input Data

1. Clinical Features

2. Medical Imaging

Preprocessing Steps

A. Clinical Data Preprocessing

Problem Solution
Missing values Filled using statistical imputation (mean, median) or kNN
Inconsistent entries Unified using mapping (e.g., "Fever", "fever" → "fever")
Mixed data types Encoded categorical data (e.g., gender: 0 for Male, 1 for Female)
Different scales Normalized numerical values (like CRP, oxygen saturation)
Redundant features Removed irrelevant or highly correlated columns

B. Imaging Data Preprocessing

Task Technique
Image format Converted DICOM to PNG or JPG
Resize Standardized to fixed size (e.g., 224*224 or 256*256)
Normalize Pixel intensity scaled to 0-1 or standardized
Data augmentation Flip, rotate, zoom, shift (to prevent overfitting)
Noise removal Denoising filters or histogram equalization
Optional step Lung segmentation using pre-trained U-Net model

C. Multi-Modal Alignment

Output Dataset Features

Tools & Technologies

Advantages of Multi-Modal Diagnosis

Outcome of Preprocessing

Real World Applications

What I Learned