Chirper and Melon Husk’s Empire (Y)
Author: Mihnea-Teodor Stoica
Chirper and Melon Husk’s Empire (Y)
Context
In a parallel social media universe, Chirper is the most popular micro-messaging platform.
Recently, it was acquired by the famous (and slightly eccentric) Melon Husk, who decided to rebrand it as Y.
To make Y cleaner and friendlier, Melon Husk has asked your data science team to build a classification model to automatically detect problematic chirps (spam, irrelevant content, or noise), so they can be filtered from the feed.
You have received a dataset of historical chirps and need to build a model
that can classify new chirps.
Two files have been provided:
- train.csv – historical chirps with a
label(problematic= 1 /normal= 0) - test.csv – new chirps without labels
Primary goal: predict the probability of a chirp being problematic
(value between 0 and 1, where 0 = definitely normal, 1 = definitely problematic).
📊 Dataset
Each row represents a chirp posted on Chirper Y, with the following fields:
id– unique identifier for the chirpchirp– text of the chirplabel– only in train.csv, 1 (problematic) / 0 (normal)
Final goal: predict label for rows in test.csv.
📝 Tasks
The first two subtasks check simple chirp analysis.
The final subtask evaluates classification model performance.
Subtask 1 (10 points)
Determine the length of each chirp in characters.
Display an integer for this subtask.
Subtask 2 (15 points)
Count how many times the character # appears in each chirp
(an important indicator for excessive hashtags, loved by spammers 😄).
Subtask 3 (75 points)
Build a classification model that predicts the probability of a chirp
being problematic (p ∈ [0,1]) for each row in test.csv.
Evaluation uses ROC curve and AUC (Area Under the ROC Curve).
🧮 Evaluation
- AUC ≥ 0.95 → 75 points
- AUC ≤ 0.80 → 0 points
- Score is proportional for values in between
Subtasks 1–2 are evaluated exactly (by comparison).
📄 Submission File Format
submission.csv must contain 3 lines for each test row,
corresponding to the 3 subtasks.
Structure:
subtaskID,datapointID,answer
📊 Evaluation Metric: ROC AUC 📈
For Subtask 3, evaluation is based on ROC AUC (Area Under the ROC Curve).
This is a single measure that summarizes classifier performance
across all possible decision thresholds.
How ROC AUC is calculated
-
Plot the ROC curve, showing:
- TPR (True Positive Rate) – proportion of correctly detected problematic chirps
- FPR (False Positive Rate) – proportion of normal chirps incorrectly flagged as problematic
-
Area under the curve (AUC) is calculated using the trapezoidal rule:
- Divide the curve into trapezoids using vertical lines at FPR values
and horizontal lines at TPR values - Sum the trapezoid areas to get the final AUC value
- Divide the curve into trapezoids using vertical lines at FPR values
-
Score interpretation:
- ROC AUC = 1 🏆 → perfect classifier, Melon Husk’s dream platform
- ROC AUC = 0.5 🎲 → random classifier, as useful as a Chirper poll
- 0.5 < ROC AUC < 1 📈 → how well the model separates normal vs. problematic chirps