Automated spam email detection system
Author: Mihai Nan
💌 Automated Spam Email Detection System 📧
Your company wants to protect users from unwanted emails (spam).
To achieve this, it has been decided to build an automated system that can
identify spam emails and separate them from legitimate (non-spam) emails.
You have been given a labeled dataset of emails and need to build a model
that can classify new emails.
Two files are provided:
- train.csv - historical emails with the
label(spam= 1 /nonspam= 0) - test.csv - new emails, without labels
The main goal is predicting the probability that an email is spam (a value between 0 and 1, where 0 = definitely non-spam, 1 = definitely spam).
📊 Dataset
Each row represents an email with the following attributes:
sample_id- unique identifiertext- email contentlabel- only in train.csv, 1 (spam) / 0 (non-spam)
The final goal is to predict the label for rows in test.csv.
📝 Tasks
The first two subtasks check simple email analysis.
The last subtask evaluates the classification model.
Subtask 1 (10 points)
Determine the length of each email in characters.
Output for this subtask should be an integer.
Subtask 2 (15 points)
Count how many times the word free appears in the email.
Subtask 3 (75 points)
Build a classification model that predicts the probability that an email is spam (p ∈ [0,1]) for each row in the test set.
Evaluation is done using the ROC curve and AUC (Area Under the ROC Curve).
🧮 Evaluation
- AUC ≥ 0.95 → 75 points
- AUC ≤ 0.80 → 0 points
- Scores in between are proportional
Subtasks 1–2 are evaluated exactly (by comparison).
📄 Submission File Format
The file submission.csv should contain 3 lines for each test row,
corresponding to the 3 subtasks.
Structure:
subtaskID, datapointID, answer
where:
- subtaskID - 1, 2, or 3
- datapointID - value of
sample_id - answer - depends on the task:
- Subtask 1: length of the email (integer)
- Subtask 2: number of occurrences of the word
free(integer) - Subtask 3: probability that the email is spam (real number 0–1)
Example for sample_id = 00042:
subtaskID,datapointID,answer
1,00042,342
2,00042,3
3,00042,0.742
📊 Evaluation Metric: ROC AUC 📈
For Subtask 3, evaluation is done using ROC AUC (Area Under the ROC Curve).
This is a single measure that summarizes a classifier's performance across all possible decision thresholds.
How ROC AUC is calculated
- Plot the ROC curve, which shows:
- TPR (True Positive Rate) - fraction of correctly identified positive cases
- FPR (False Positive Rate) - fraction of negative cases incorrectly classified as positive

- Area under the curve (AUC) is computed using the trapezoidal rule:
- The curve is split into trapezoids using vertical lines at FPR values and horizontal lines at TPR values
- The areas of the trapezoids are summed to obtain the final AUC

- Interpreting the score:
- ROC AUC = 1 🏆 → perfect classifier, all cases are correctly classified
- ROC AUC = 0.5 🎲 → random classifier, no predictive power
- 0.5 < ROC AUC < 1 📈 → measures how well the classifier separates the classes