Alzheimer’s Diagnosis Prediction System
Author: Mihai Nan
🧠 Alzheimer’s Diagnosis Prediction System 🩺
A medical institute aims to develop an automated system to assist in
identifying patients at risk of Alzheimer’s disease, using clinical,
demographic, and behavioral data.
You are given a dataset of clinically evaluated patients, and your goal is to
build a binary classification model that can predict the diagnosis for new patients.
The following files are provided:
- train.csv – historical patient data, including the diagnosis
- test.csv – new patient data, without diagnosis
The main objective: predict the probability that a patient is diagnosed with Alzheimer’s disease
(a value between 0 and 1, where 0 = healthy patient, 1 = patient with Alzheimer’s).
📊 Dataset
Each row represents a patient and contains the following attributes:
🔑 Identification
PatientID– unique patient identifier
👤 Demographic data
Age– patient ageGender– sex (0= female,1= male)Ethnicity– patient ethnicity (numerically encoded)EducationLevel– education level (numerically encoded)
🧬 Lifestyle and health
BMI– body mass indexSmoking– smoker (1) / non-smoker (0)AlcoholConsumption– alcohol consumptionPhysicalActivity– level of physical activityDietQuality– diet qualitySleepQuality– sleep quality
🏥 Medical history
FamilyHistoryAlzheimers– family history of Alzheimer’sCardiovascularDisease– cardiovascular diseaseDiabetes– diabetesDepression– depressionHeadInjury– head injuryHypertension– hypertensionSystolicBP,DiastolicBP– blood pressureCholesterolTotal,CholesterolLDL,CholesterolHDL,CholesterolTriglycerides
🧠 Cognitive and functional assessments
MMSE– Mini-Mental State Examination scoreFunctionalAssessment– functional assessmentADL– activities of daily living
⚠️ Cognitive and behavioral symptoms
MemoryComplaintsBehavioralProblemsConfusionDisorientationPersonalityChangesDifficultyCompletingTasksForgetfulness
🎯 Target label
Diagnosis– only in train.csv1= Alzheimer’s diagnosis0= no Alzheimer’s diagnosis
The final goal: predict Diagnosis for the rows in test.csv.
📝 Tasks
The first two subtasks assess exploratory data analysis.
The final subtask evaluates the performance of the classification model.
Subtask 1 (10 points)
For each patient in test.csv, calculate how many patients in the training set (train.csv)
have the same age as the patient from the test set.
Output a natural number in the submission file (according to the format below).
Subtask 2 (15 points)
For each patient in test.csv, determine the percentage of smokers
(Smoking = 1) in the training set (train.csv) who have the same age
as the respective patient.
Formula for a patient with age v:
(number of smokers in train with Age = v) / (total number of patients in train with Age = v) * 100
Output a real number between 0 and 100,
with at most 2 decimal places, for each patient.
If there are no patients with that age in the training set,
the output value should be 0.
Subtask 3 (75 points)
Build a classification model that predicts the probability of an Alzheimer’s diagnosis
(p ∈ [0,1]) for each patient in test.csv.
Evaluation is performed using the ROC Curve and AUC (Area Under the ROC Curve).
🧮 Evaluation
- AUC ≥ 0.90 → 75 points
- AUC ≤ 0.70 → 0 points
- Values in between receive proportional scores
Subtasks 1–2 are evaluated exactly (by direct comparison).
📄 Submission file format
The file submission.csv must contain 3 rows for each patient in the test set,
corresponding to the 3 subtasks.
Structure:
subtaskID, datapointID, answer
where:
- subtaskID – 1, 2, or 3
- datapointID – the value of
PatientID - answer – depends on the subtask:
- Subtask 1: number of patients with the same age in the training set (natural number)
- Subtask 2: percentage of smokers with the same age as the patient (real number)
- Subtask 3: probability of Alzheimer’s diagnosis (real number between 0 and 1)
Example for PatientID = 4751:
subtaskID,datapointID,answer
1,4751,23
2,4751,31.8
3,4751,0.873
📊 Evaluation metric: ROC AUC 📈
For Subtask 3, evaluation is performed using ROC AUC (Area Under the ROC Curve),
a standard metric for binary classification problems.
Score interpretation:
- ROC AUC = 1.0 🏆 → perfect model
- ROC AUC = 0.5 🎲 → random classification
- ROC AUC > 0.5 📈 → higher values indicate better class separation