Autore: Mihai Nan
A medical institute aims to develop an automated system to assist in
identifying patients at risk of Alzheimer’s disease, using clinical,
demographic, and behavioral data.
You are given a dataset of clinically evaluated patients, and your goal is to
build a binary classification model that can predict the diagnosis for new patients.
The following files are provided:
The main objective: predict the probability that a patient is diagnosed with Alzheimer’s disease
(a value between 0 and 1, where 0 = healthy patient, 1 = patient with Alzheimer’s).
Each row represents a patient and contains the following attributes:
PatientID – unique patient identifierAge – patient ageGender – sex (0 = female, 1 = male)Ethnicity – patient ethnicity (numerically encoded)EducationLevel – education level (numerically encoded)BMI – body mass indexSmoking – smoker (1) / non-smoker (0)AlcoholConsumption – alcohol consumptionPhysicalActivity – level of physical activityDietQuality – diet qualitySleepQuality – sleep qualityFamilyHistoryAlzheimers – family history of Alzheimer’sCardiovascularDisease – cardiovascular diseaseDiabetes – diabetesDepression – depressionHeadInjury – head injuryHypertension – hypertensionSystolicBP, DiastolicBP – blood pressureCholesterolTotal, CholesterolLDL, CholesterolHDL, CholesterolTriglyceridesMMSE – Mini-Mental State Examination scoreFunctionalAssessment – functional assessmentADL – activities of daily livingMemoryComplaintsBehavioralProblemsConfusionDisorientationPersonalityChangesDifficultyCompletingTasksForgetfulnessDiagnosis – only in train.csv
1 = Alzheimer’s diagnosis0 = no Alzheimer’s diagnosisThe final goal: predict Diagnosis for the rows in test.csv.
The first two subtasks assess exploratory data analysis.
The final subtask evaluates the performance of the classification model.
For each patient in test.csv, calculate how many patients in the training set (train.csv)
have the same age as the patient from the test set.
Output a natural number in the submission file (according to the format below).
For each patient in test.csv, determine the percentage of smokers
(Smoking = 1) in the training set (train.csv) who have the same age
as the respective patient.
Formula for a patient with age v:
(number of smokers in train with Age = v) / (total number of patients in train with Age = v) * 100
Output a real number between 0 and 100,
with at most 2 decimal places, for each patient.
If there are no patients with that age in the training set,
the output value should be 0.
Build a classification model that predicts the probability of an Alzheimer’s diagnosis
(p ∈ [0,1]) for each patient in test.csv.
Evaluation is performed using the ROC Curve and AUC (Area Under the ROC Curve).
Subtasks 1–2 are evaluated exactly (by direct comparison).
The file submission.csv must contain 3 rows for each patient in the test set,
corresponding to the 3 subtasks.
Structure:
subtaskID, datapointID, answer
where:
PatientIDPatientID = 4751:subtaskID,datapointID,answer
1,4751,23
2,4751,31.8
3,4751,0.873
For Subtask 3, evaluation is performed using ROC AUC (Area Under the ROC Curve),
a standard metric for binary classification problems.