Prediction of academic performance
Автор: Mihai Nan
📝 Student Performance Prediction
📘 Problem Description
The goal is to build a regression model to predict the final exam score (Exam_Score) based on students' academic, social, and personal factors.
The model receives a set of features and must estimate a continuous numeric value.
🔹 Features
Each instance contains multiple variables, such as:
StudyHoursAttendanceParentalInvolvementHealthStatus...(other columns present in the dataset)
Target label:
- Exam_Score — numeric exam score
📘 Input File Structure
train.csv
Contains all features + target label.
Required columns:
SampleID- various features (categorical + numeric)
Exam_Score
Example:
| SampleID | StudyHours | Attendance | ParentalInvolvement | ... | Exam_Score |
|---|---|---|---|---|---|
| 1 | 3.5 | High | Medium | ... | 78 |
| 2 | 1.2 | Low | Low | ... | 55 |
| 3 | 4.0 | High | High | ... | 92 |
test.csv
Has the same structure as train.csv, but without the Exam_Score column, since it must be predicted.
Example:
| SampleID | StudyHours | Attendance | ParentalInvolvement | ... |
|---|---|---|---|---|
| 101 | 3.0 | High | Medium | ... |
| 102 | 0.7 | Low | Low | ... |
📤 Submission
The file submission.csv must contain exactly two columns:
SampleIDExam_Score— model prediction
Example:
| SampleID | Exam_Score |
|---|---|
| 101 | 81.2 |
| 102 | 49.7 |
⚙️ Evaluation
Model evaluation uses two metrics:
- Partial RMSE — using 50% of the data
- Complete RMSE — using all the data
Then RMSE is converted into a 0–100 score using linear interpolation:
- Low RMSE → high score
- High RMSE → low score
The ideal model (RMSE = 0) achieves the maximum score of 100.
📊 Data Source
The dataset is generated based on the public Kaggle dataset: Student Performance Factors Dataset