Skip to main content

Automated spam email detection system

Author: Mihai Nan

Easy
Your best score: N/A
Problem Description

💌 Automated Spam Email Detection System 📧

Your company wants to protect users from unwanted emails (spam).
To achieve this, it has been decided to build an automated system that can
identify spam emails and separate them from legitimate (non-spam) emails.

You have been given a labeled dataset of emails and need to build a model
that can classify new emails.

Two files are provided:

  • train.csv - historical emails with the label (spam = 1 / nonspam = 0)
  • test.csv - new emails, without labels

The main goal is predicting the probability that an email is spam (a value between 0 and 1, where 0 = definitely non-spam, 1 = definitely spam).


📊 Dataset

Each row represents an email with the following attributes:

  • sample_id - unique identifier
  • text - email content
  • label - only in train.csv, 1 (spam) / 0 (non-spam)

The final goal is to predict the label for rows in test.csv.


📝 Tasks

The first two subtasks check simple email analysis.
The last subtask evaluates the classification model.


Subtask 1 (10 points)

Determine the length of each email in characters.

Output for this subtask should be an integer.


Subtask 2 (15 points)

Count how many times the word free appears in the email.


Subtask 3 (75 points)

Build a classification model that predicts the probability that an email is spam (p[0,1]) for each row in the test set.

Evaluation is done using the ROC curve and AUC (Area Under the ROC Curve).


🧮 Evaluation

  • AUC ≥ 0.95 → 75 points
  • AUC ≤ 0.80 → 0 points
  • Scores in between are proportional

Subtasks 1–2 are evaluated exactly (by comparison).


📄 Submission File Format

The file submission.csv should contain 3 lines for each test row,
corresponding to the 3 subtasks.

Structure:

subtaskID, datapointID, answer

where:

  • subtaskID - 1, 2, or 3
  • datapointID - value of sample_id
  • answer - depends on the task:
    • Subtask 1: length of the email (integer)
    • Subtask 2: number of occurrences of the word free (integer)
    • Subtask 3: probability that the email is spam (real number 0–1)

Example for sample_id = 00042:

subtaskID,datapointID,answer
1,00042,342
2,00042,3
3,00042,0.742

📊 Evaluation Metric: ROC AUC 📈

For Subtask 3, evaluation is done using ROC AUC (Area Under the ROC Curve).
This is a single measure that summarizes a classifier's performance across all possible decision thresholds.

How ROC AUC is calculated

  1. Plot the ROC curve, which shows:
    • TPR (True Positive Rate) - fraction of correctly identified positive cases
    • FPR (False Positive Rate) - fraction of negative cases incorrectly classified as positive

  1. Area under the curve (AUC) is computed using the trapezoidal rule:
    • The curve is split into trapezoids using vertical lines at FPR values and horizontal lines at TPR values
    • The areas of the trapezoids are summed to obtain the final AUC

  1. Interpreting the score:
    • ROC AUC = 1 🏆 → perfect classifier, all cases are correctly classified
    • ROC AUC = 0.5 🎲 → random classifier, no predictive power
    • 0.5 < ROC AUC < 1 📈 → measures how well the classifier separates the classes
Submit Solution
Upload output file and optionally source code for evaluation.

Submission File

Source Code File (optional)

Sign in to upload a submission.