🏥 Problem: Patient Status Analysis and Prediction
For this problem you need to implement a machine learning model to predict the value of the Status field using an available dataset that contains information about patients. The dataset is organized in a CSV file that includes various features, and the model evaluation will be based on precision for the Dead class.
📊 Dataset
The dataset contains the following fields:
- Age: Patient's age.
- Race: Patient's race (e.g., White, Other).
- Marital Status: Patient's marital status.
- T Stage: Tumor stage (T1, T2, T3, T4).
- N Stage: Degree of lymph node involvement.
- 6th Stage: Cancer stage according to the 6th edition TNM classification.
- differentiate: Degree of tumor differentiation.
- Grade: Histological grade of the tumor (1, 2, 3 etc.).
- A Stage: Disease stage classification.
- Tumor Size: Tumor size.
- BMI: Body mass index.
- Heart Rate: Heart rate.
- Serum Creatinine: Serum creatinine level.
- Uric Acid: Uric acid level.
- Hemoglobin: Hemoglobin concentration.
- GFR (Glomerular Filtration Rate): Glomerular filtration rate.
- Serum Sodium: Serum sodium concentration.
- Serum Potassium: Serum potassium concentration.
- Serum Albumin: Serum albumin level.
- Lactate: Lactate concentration.
- Status: Patient status ("Dead" or "Alive") – target field.
📝 Tasks
For the first subtasks, you will need to load the dataset and perform statistical analyses on train.csv.
Subtask 1 (10 points)
Classify kidney function for each patient based on the GFR value:
Normal if GFR >= 90
Mildly Decreased if 60 <= GFR < 90
Subtask 2 (10 points)
Calculate the quartiles of values from the Serum Creatinine column (Q1, Q2, Q3) and classify patients in test:
Very Low if Serum Creatinine <= Q1
Low if Q1 < Serum Creatinine <= Q2
High if Q2 < Serum Creatinine <= Q3
Very High if Serum Creatinine > Q3
Subtask 3 (10 points)
Determining BMI value:
- 1 if BMI > median from train
- 0 if BMI <= median
Subtask 4 (10 points)
Number of patients from train who are in the same T Stage as the patient from test.
Subtask 5 (60 points)
Build an ML model to predict Status (Dead or Alive) based on features. Evaluation is based on precision for the Dead class.
🧮 Evaluation
The evaluation metric is precision for the Dead class:

where:
- TP = number of patients correctly predicted as
Dead
- FP = number of patients incorrectly predicted as
Dead
Submission file format
The submission file must be in CSV (Comma-Separated Values) format, with the following columns:
- subtaskID - the task index for which we provide the answer
- datapointID - unique identifier of the patient from the dataset
- answer - the answer associated with the patient from the dataset for the indicated task
Example:
subtaskID,datapointID,answer
1,3220,Normal
2,3220,Very High
3,3220,0
4,3220,1281
5,3220,1
📌 Notes
- The target field is
Status.
- Some fields may contain missing values (NaN) and must be handled.
- It is recommended to analyze the correlation between variables and the target variable.
- Detect and remove outliers if they affect model performance.
- Submitting a
sample_output for local testing provides 5 points.
🗂️ Useful resources