Patient Status Prediction - OJIA 2 Simulation (v1)
Autor: Mihai Nan
🏥 Problem: Patient Status Analysis and Prediction
For this problem, you need to implement a machine learning model to predict the Status field using a provided dataset containing patient information. The dataset is organized in a CSV file with various features, and the model will be evaluated based on precision for the Dead class.
📊 Dataset
The dataset contains the following fields:
- Age: Patient's age.
- Race: Patient's race (e.g., White, Other).
- Marital Status: Patient's marital status.
- T Stage: Tumor stage (T1, T2, T3, T4).
- N Stage: Lymph node involvement.
- 6th Stage: Cancer stage according to the 6th edition of the TNM classification.
- differentiate: Tumor differentiation grade.
- Grade: Histological grade of the tumor (1, 2, 3, etc.).
- A Stage: Disease stage classification.
- Tumor Size: Tumor size.
- BMI: Body Mass Index.
- Heart Rate: Heart rate.
- Serum Creatinine: Serum creatinine level.
- Uric Acid: Uric acid level.
- Hemoglobin: Hemoglobin concentration.
- GFR (Glomerular Filtration Rate): Glomerular filtration rate.
- Serum Sodium: Serum sodium concentration.
- Serum Potassium: Serum potassium concentration.
- Serum Albumin: Serum albumin level.
- Lactate: Lactate concentration.
- Status: Patient status (
DeadorAlive) – target field.
📝 Tasks
For the first subtasks, you will need to load the dataset and perform statistical analyses on test.csv.
Subtask 1 (10 points)
Classify kidney function for each patient in the test set based on GFR:
Normalif GFR >= 90Mildly Decreasedif 60 <= GFR < 90
Subtask 2 (10 points)
Compute the quartiles (Q1, Q2, Q3) of the Serum Creatinine column in the training set and classify patients in the test set as follows:
Very Lowif Serum Creatinine <= Q1Lowif Q1 < Serum Creatinine <= Q2Highif Q2 < Serum Creatinine <= Q3Very Highif Serum Creatinine > Q3
Subtask 3 (10 points)
Determine BMI value:
- Compute the median of the BMI column in the training set.
- For patients in the test set, assign:
- 1 if BMI > median in train
- 0 if BMI <= median
Subtask 4 (10 points)
Count the number of patients in train with the same T Stage as the patient in test.
Subtask 5 (60 points)
Build an ML model to predict Status (Dead or Alive) based on features. Evaluation is based on precision for the Dead class.
🧮 Evaluation
The evaluation metric is precision for the Dead class:
where:
- TP = number of patients correctly predicted as
Dead - FP = number of patients incorrectly predicted as
Dead
📌 Notes
- The target field is
Status. - Some fields may contain missing values (NaN) and need to be handled.
- Correlation analysis between features and the target variable is recommended.
- Detect and remove outliers if they affect model performance.
- Submitting a
sample_outputfor local testing gives 5 points.
📄 Submission File Format
The submission.csv file must contain results for all 5 subtasks for each row in the test set.
Each test row generates 5 lines in the file, one for each subtask.
File structure:
subtaskID, datapointID, answer
Column meanings:
- subtaskID – subtask number (1–5)
- datapointID – patient ID from the test set (if no ID, use the row index)
- answer – subtask result:
- Subtask 1:
Normal/Mildly Decreased/Decreased - Subtask 2:
Very Low/Low/High/Very High - Subtask 3:
0or1 - Subtask 4: integer (count)
- Subtask 5:
1if model predicts Dead,0if model predicts Alive
- Subtask 1:
Example for a single patient with ID 3220:
subtaskID datapointID answer
1 3220 Normal
2 3220 Low
3 3220 0
4 3220 5
5 3220 1
🗂️ Useful Resources
- Complete Starter Kit – contains a skeleton to start solving the problem