Chirper and Melon Husk’s Empire (Y)

Author: Mihnea-Teodor Stoica

Medium

Your best score: N/A

Problem Description

Chirper and Melon Husk’s Empire (Y)

Context

In a parallel social media universe, Chirper is the most popular micro-messaging platform.
Recently, it was acquired by the famous (and slightly eccentric) Melon Husk, who decided to rebrand it as Y.

To make Y cleaner and friendlier, Melon Husk has asked your data science team to build a classification model to automatically detect problematic chirps (spam, irrelevant content, or noise), so they can be filtered from the feed.

You have received a dataset of historical chirps and need to build a model
that can classify new chirps.

Two files have been provided:

train.csv – historical chirps with a label (problematic = 1 / normal = 0)
test.csv – new chirps without labels

Primary goal: predict the probability of a chirp being problematic
(value between 0 and 1, where 0 = definitely normal, 1 = definitely problematic).

📊 Dataset

Each row represents a chirp posted on Chirper Y, with the following fields:

id – unique identifier for the chirp
chirp – text of the chirp
label – only in train.csv, 1 (problematic) / 0 (normal)

Final goal: predict label for rows in test.csv.

📝 Tasks

The first two subtasks check simple chirp analysis.
The final subtask evaluates classification model performance.

Subtask 1 (10 points)

Determine the length of each chirp in characters.

Display an integer for this subtask.

Subtask 2 (15 points)

Count how many times the character # appears in each chirp
(an important indicator for excessive hashtags, loved by spammers 😄).

Subtask 3 (75 points)

Build a classification model that predicts the probability of a chirp
being problematic (p ∈ [0,1]) for each row in test.csv.

Evaluation uses ROC curve and AUC (Area Under the ROC Curve).

🧮 Evaluation

AUC ≥ 0.95 → 75 points
AUC ≤ 0.80 → 0 points
Score is proportional for values in between

Subtasks 1–2 are evaluated exactly (by comparison).

📄 Submission File Format

submission.csv must contain 3 lines for each test row,
corresponding to the 3 subtasks.

Structure:

subtaskID,datapointID,answer

📊 Evaluation Metric: ROC AUC 📈

For Subtask 3, evaluation is based on ROC AUC (Area Under the ROC Curve).
This is a single measure that summarizes classifier performance
across all possible decision thresholds.

How ROC AUC is calculated

Plot the ROC curve, showing:
- TPR (True Positive Rate) – proportion of correctly detected problematic chirps
- FPR (False Positive Rate) – proportion of normal chirps incorrectly flagged as problematic
Area under the curve (AUC) is calculated using the trapezoidal rule:
- Divide the curve into trapezoids using vertical lines at FPR values
  and horizontal lines at TPR values
- Sum the trapezoid areas to get the final AUC value
Score interpretation:
- ROC AUC = 1 🏆 → perfect classifier, Melon Husk’s dream platform
- ROC AUC = 0.5 🎲 → random classifier, as useful as a Chirper poll
- 0.5 < ROC AUC < 1 📈 → how well the model separates normal vs. problematic chirps

Files

Submit Solution

Upload output file and optionally source code for evaluation.

Submission File

Click to upload or drag and drop

CSV, ZIP, etc. (MAX. 100MB)

Source Code File (optional)

Click to upload or drag and drop

Archive, notebook or code file