Skip to main content

Alzheimer’s Diagnosis Prediction System

Author: Mihai Nan

Easy
Your best score: N/A
Problem Description

🧠 Alzheimer’s Diagnosis Prediction System 🩺

A medical institute aims to develop an automated system to assist in
identifying patients at risk of Alzheimer’s disease, using clinical,
demographic, and behavioral data.

You are given a dataset of clinically evaluated patients, and your goal is to
build a binary classification model that can predict the diagnosis for new patients.

The following files are provided:

  • train.csv – historical patient data, including the diagnosis
  • test.csv – new patient data, without diagnosis

The main objective: predict the probability that a patient is diagnosed with Alzheimer’s disease
(a value between 0 and 1, where 0 = healthy patient, 1 = patient with Alzheimer’s).


📊 Dataset

Each row represents a patient and contains the following attributes:

🔑 Identification

  • PatientID – unique patient identifier

👤 Demographic data

  • Age – patient age
  • Gender – sex (0 = female, 1 = male)
  • Ethnicity – patient ethnicity (numerically encoded)
  • EducationLevel – education level (numerically encoded)

🧬 Lifestyle and health

  • BMI – body mass index
  • Smoking – smoker (1) / non-smoker (0)
  • AlcoholConsumption – alcohol consumption
  • PhysicalActivity – level of physical activity
  • DietQuality – diet quality
  • SleepQuality – sleep quality

🏥 Medical history

  • FamilyHistoryAlzheimers – family history of Alzheimer’s
  • CardiovascularDisease – cardiovascular disease
  • Diabetes – diabetes
  • Depression – depression
  • HeadInjury – head injury
  • Hypertension – hypertension
  • SystolicBP, DiastolicBP – blood pressure
  • CholesterolTotal, CholesterolLDL, CholesterolHDL, CholesterolTriglycerides

🧠 Cognitive and functional assessments

  • MMSE – Mini-Mental State Examination score
  • FunctionalAssessment – functional assessment
  • ADL – activities of daily living

⚠️ Cognitive and behavioral symptoms

  • MemoryComplaints
  • BehavioralProblems
  • Confusion
  • Disorientation
  • PersonalityChanges
  • DifficultyCompletingTasks
  • Forgetfulness

🎯 Target label

  • Diagnosisonly in train.csv
    • 1 = Alzheimer’s diagnosis
    • 0 = no Alzheimer’s diagnosis

The final goal: predict Diagnosis for the rows in test.csv.


📝 Tasks

The first two subtasks assess exploratory data analysis.
The final subtask evaluates the performance of the classification model.


Subtask 1 (10 points)

For each patient in test.csv, calculate how many patients in the training set (train.csv)
have the same age as the patient from the test set.

Output a natural number in the submission file (according to the format below).


Subtask 2 (15 points)

For each patient in test.csv, determine the percentage of smokers
(Smoking = 1) in the training set (train.csv) who have the same age
as the respective patient.

Formula for a patient with age v:

(number of smokers in train with Age = v) / (total number of patients in train with Age = v) * 100

Output a real number between 0 and 100,
with at most 2 decimal places, for each patient.

If there are no patients with that age in the training set,
the output value should be 0.


Subtask 3 (75 points)

Build a classification model that predicts the probability of an Alzheimer’s diagnosis
(p[0,1]) for each patient in test.csv.

Evaluation is performed using the ROC Curve and AUC (Area Under the ROC Curve).


🧮 Evaluation

  • AUC ≥ 0.90 → 75 points
  • AUC ≤ 0.70 → 0 points
  • Values in between receive proportional scores

Subtasks 1–2 are evaluated exactly (by direct comparison).


📄 Submission file format

The file submission.csv must contain 3 rows for each patient in the test set,
corresponding to the 3 subtasks.

Structure:

subtaskID, datapointID, answer

where:

  • subtaskID – 1, 2, or 3
  • datapointID – the value of PatientID
  • answer – depends on the subtask:
    • Subtask 1: number of patients with the same age in the training set (natural number)
    • Subtask 2: percentage of smokers with the same age as the patient (real number)
    • Subtask 3: probability of Alzheimer’s diagnosis (real number between 0 and 1)

Example for PatientID = 4751:

subtaskID,datapointID,answer
1,4751,23
2,4751,31.8
3,4751,0.873

📊 Evaluation metric: ROC AUC 📈

For Subtask 3, evaluation is performed using ROC AUC (Area Under the ROC Curve),
a standard metric for binary classification problems.

Score interpretation:

  • ROC AUC = 1.0 🏆 → perfect model
  • ROC AUC = 0.5 🎲 → random classification
  • ROC AUC > 0.5 📈 → higher values indicate better class separation
Submit Solution
Upload output file and optionally source code for evaluation.

Submission File

Source Code File (optional)

Sign in to upload a submission.