Autor: Mihnea-Teodor Stoica
In a parallel social media universe, Chirper is the most popular micro-messaging platform.
Recently, it was acquired by the famous (and slightly eccentric) Melon Husk, who decided to rebrand it as Y.
To make Y cleaner and friendlier, Melon Husk has asked your data science team to build a classification model to automatically detect problematic chirps (spam, irrelevant content, or noise), so they can be filtered from the feed.
You have received a dataset of historical chirps and need to build a model
that can classify new chirps.
Two files have been provided:
label (problematic = 1 / normal = 0)Primary goal: predict the probability of a chirp being problematic
(value between 0 and 1, where 0 = definitely normal, 1 = definitely problematic).
Each row represents a chirp posted on Chirper Y, with the following fields:
id – unique identifier for the chirpchirp – text of the chirplabel – only in train.csv, 1 (problematic) / 0 (normal)Final goal: predict label for rows in test.csv.
The first two subtasks check simple chirp analysis.
The final subtask evaluates classification model performance.
Determine the length of each chirp in characters.
Display an integer for this subtask.
Count how many times the character # appears in each chirp
(an important indicator for excessive hashtags, loved by spammers 😄).
Build a classification model that predicts the probability of a chirp
being problematic (p ∈ [0,1]) for each row in test.csv.
Evaluation uses ROC curve and AUC (Area Under the ROC Curve).
Subtasks 1–2 are evaluated exactly (by comparison).
submission.csv must contain 3 lines for each test row,
corresponding to the 3 subtasks.
Structure:
subtaskID,datapointID,answer
For Subtask 3, evaluation is based on ROC AUC (Area Under the ROC Curve).
This is a single measure that summarizes classifier performance
across all possible decision thresholds.
Plot the ROC curve, showing:
Area under the curve (AUC) is calculated using the trapezoidal rule:
Score interpretation: