Аутор: Mihai Nan
The goal is to build a classification model that predicts whether a patient has diabetes based on blood tests and demographic data.
Each patient is characterized by 8 numerical attributes obtained from analyses and clinical measurements, and the label (target) indicates the presence of diabetes (1 for positive, 0 for negative).
This type of problem belongs to the binary classification category.
pregnancies – number of pregnanciesglucose – blood glucose levelblood_pressure – blood pressureskin_thickness – skin fold thicknessinsulin – insulin levelbmi – body mass indexdiabetes_pedigree_function – genetic risk scoreage – patient agetrain.csvContains all 8 feature columns plus the column:
target – indicates the presence of diabetes (0 or 1)Example:
| SampleID | pregnancies | glucose | blood_pressure | skin_thickness | insulin | bmi | diabetes_pedigree_function | age | target |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
| 2 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
test.csvContains the same columns without target, but includes SampleID.
Example:
| SampleID | pregnancies | glucose | blood_pressure | skin_thickness | insulin | bmi | diabetes_pedigree_function | age |
|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 |
| 2 | 5 | 116 | 74 | 0 | 0 | 25.6 | 0.201 | 30 |
The output file (submission.csv) must contain exactly two columns:
SampleIDlabel – the label predicted by the model (0 or 1)Example:
| SampleID | label |
|---|---|
| 1 | 1 |
| 2 | 0 |
| 3 | 0 |
Model evaluation will be performed using the following metric:
This metric is suitable for binary classification because it gives equal importance to prediction accuracy for both classes.
General formula:
where:
The final score is expressed as a percentage (0–100), rounded to two decimal places.
The dataset comes from the original collection:
Pima Indians Diabetes Database – Kaggle