Alzheimer’s Diagnosis Prediction System

Author: Mihai Nan

Easy

Your best score: N/A

Problem Description

🧠 Alzheimer’s Diagnosis Prediction System 🩺

A medical institute aims to develop an automated system to assist in
identifying patients at risk of Alzheimer’s disease, using clinical,
demographic, and behavioral data.

You are given a dataset of clinically evaluated patients, and your goal is to
build a binary classification model that can predict the diagnosis for new patients.

The following files are provided:

train.csv – historical patient data, including the diagnosis
test.csv – new patient data, without diagnosis

The main objective: predict the probability that a patient is diagnosed with Alzheimer’s disease
(a value between 0 and 1, where 0 = healthy patient, 1 = patient with Alzheimer’s).

📊 Dataset

Each row represents a patient and contains the following attributes:

🔑 Identification

PatientID – unique patient identifier

👤 Demographic data

Age – patient age
Gender – sex (0 = female, 1 = male)
Ethnicity – patient ethnicity (numerically encoded)
EducationLevel – education level (numerically encoded)

🧬 Lifestyle and health

BMI – body mass index
Smoking – smoker (1) / non-smoker (0)
AlcoholConsumption – alcohol consumption
PhysicalActivity – level of physical activity
DietQuality – diet quality
SleepQuality – sleep quality

🏥 Medical history

FamilyHistoryAlzheimers – family history of Alzheimer’s
CardiovascularDisease – cardiovascular disease
Diabetes – diabetes
Depression – depression
HeadInjury – head injury
Hypertension – hypertension
SystolicBP, DiastolicBP – blood pressure
CholesterolTotal, CholesterolLDL, CholesterolHDL, CholesterolTriglycerides

🧠 Cognitive and functional assessments

MMSE – Mini-Mental State Examination score
FunctionalAssessment – functional assessment
ADL – activities of daily living

⚠️ Cognitive and behavioral symptoms

MemoryComplaints
BehavioralProblems
Confusion
Disorientation
PersonalityChanges
DifficultyCompletingTasks
Forgetfulness

🎯 Target label

Diagnosis – only in train.csv
- 1 = Alzheimer’s diagnosis
- 0 = no Alzheimer’s diagnosis

The final goal: predict Diagnosis for the rows in test.csv.

📝 Tasks

The first two subtasks assess exploratory data analysis.
The final subtask evaluates the performance of the classification model.

Subtask 1 (10 points)

For each patient in test.csv, calculate how many patients in the training set (train.csv)
have the same age as the patient from the test set.

Output a natural number in the submission file (according to the format below).

Subtask 2 (15 points)

For each patient in test.csv, determine the percentage of smokers
(Smoking = 1) in the training set (train.csv) who have the same age
as the respective patient.

Formula for a patient with age v:

(number of smokers in train with Age = v) / (total number of patients in train with Age = v) * 100

Output a real number between 0 and 100,
with at most 2 decimal places, for each patient.

If there are no patients with that age in the training set,
the output value should be 0.

Subtask 3 (75 points)

Build a classification model that predicts the probability of an Alzheimer’s diagnosis
(p ∈ [0,1]) for each patient in test.csv.

Evaluation is performed using the ROC Curve and AUC (Area Under the ROC Curve).

🧮 Evaluation

AUC ≥ 0.90 → 75 points
AUC ≤ 0.70 → 0 points
Values in between receive proportional scores

Subtasks 1–2 are evaluated exactly (by direct comparison).

📄 Submission file format

The file submission.csv must contain 3 rows for each patient in the test set,
corresponding to the 3 subtasks.

Structure:

subtaskID, datapointID, answer

where:

subtaskID – 1, 2, or 3
datapointID – the value of PatientID
answer – depends on the subtask:
- Subtask 1: number of patients with the same age in the training set (natural number)
- Subtask 2: percentage of smokers with the same age as the patient (real number)
- Subtask 3: probability of Alzheimer’s diagnosis (real number between 0 and 1)

Example for `PatientID = 4751`:

subtaskID,datapointID,answer
1,4751,23
2,4751,31.8
3,4751,0.873

📊 Evaluation metric: ROC AUC 📈

For Subtask 3, evaluation is performed using ROC AUC (Area Under the ROC Curve),
a standard metric for binary classification problems.

Score interpretation:

ROC AUC = 1.0 🏆 → perfect model
ROC AUC = 0.5 🎲 → random classification
ROC AUC > 0.5 📈 → higher values indicate better class separation

Files

Submit Solution

Upload output file and optionally source code for evaluation.

Submission File

Click to upload or drag and drop

CSV, ZIP, etc. (MAX. 100MB)

Source Code File (optional)