Automated spam email detection system

Author: Mihai Nan

Easy

Your best score: N/A

Problem Description

💌 Automated Spam Email Detection System 📧

Your company wants to protect users from unwanted emails (spam).
To achieve this, it has been decided to build an automated system that can
identify spam emails and separate them from legitimate (non-spam) emails.

You have been given a labeled dataset of emails and need to build a model
that can classify new emails.

Two files are provided:

train.csv - historical emails with the label (spam = 1 / nonspam = 0)
test.csv - new emails, without labels

The main goal is predicting the probability that an email is spam (a value between 0 and 1, where 0 = definitely non-spam, 1 = definitely spam).

📊 Dataset

Each row represents an email with the following attributes:

sample_id - unique identifier
text - email content
label - only in train.csv, 1 (spam) / 0 (non-spam)

The final goal is to predict the label for rows in test.csv.

📝 Tasks

The first two subtasks check simple email analysis.
The last subtask evaluates the classification model.

Subtask 1 (10 points)

Determine the length of each email in characters.

Output for this subtask should be an integer.

Subtask 2 (15 points)

Count how many times the word free appears in the email.

Subtask 3 (75 points)

Build a classification model that predicts the probability that an email is spam (p ∈ [0,1]) for each row in the test set.

Evaluation is done using the ROC curve and AUC (Area Under the ROC Curve).

🧮 Evaluation

AUC ≥ 0.95 → 75 points
AUC ≤ 0.80 → 0 points
Scores in between are proportional

Subtasks 1–2 are evaluated exactly (by comparison).

📄 Submission File Format

The file submission.csv should contain 3 lines for each test row,
corresponding to the 3 subtasks.

Structure:

subtaskID, datapointID, answer

where:

subtaskID - 1, 2, or 3
datapointID - value of sample_id
answer - depends on the task:
- Subtask 1: length of the email (integer)
- Subtask 2: number of occurrences of the word free (integer)
- Subtask 3: probability that the email is spam (real number 0–1)

Example for `sample_id = 00042`:

subtaskID,datapointID,answer
1,00042,342
2,00042,3
3,00042,0.742

📊 Evaluation Metric: ROC AUC 📈

For Subtask 3, evaluation is done using ROC AUC (Area Under the ROC Curve).
This is a single measure that summarizes a classifier's performance across all possible decision thresholds.

How ROC AUC is calculated

Plot the ROC curve, which shows:
- TPR (True Positive Rate) - fraction of correctly identified positive cases
- FPR (False Positive Rate) - fraction of negative cases incorrectly classified as positive

Area under the curve (AUC) is computed using the trapezoidal rule:
- The curve is split into trapezoids using vertical lines at FPR values and horizontal lines at TPR values
- The areas of the trapezoids are summed to obtain the final AUC

Interpreting the score:
- ROC AUC = 1 🏆 → perfect classifier, all cases are correctly classified
- ROC AUC = 0.5 🎲 → random classifier, no predictive power
- 0.5 < ROC AUC < 1 📈 → measures how well the classifier separates the classes

Files

Submit Solution

Upload output file and optionally source code for evaluation.

Submission File

Click to upload or drag and drop

CSV, ZIP, etc. (MAX. 100MB)

Source Code File (optional)