Skip to main content

Chirper and Melon Husk’s Empire (Y)

Author: Mihnea-Teodor Stoica

Medium
Your best score: N/A
Problem Description

Chirper and Melon Husk’s Empire (Y)

Context

In a parallel social media universe, Chirper is the most popular micro-messaging platform.
Recently, it was acquired by the famous (and slightly eccentric) Melon Husk, who decided to rebrand it as Y.

To make Y cleaner and friendlier, Melon Husk has asked your data science team to build a classification model to automatically detect problematic chirps (spam, irrelevant content, or noise), so they can be filtered from the feed.

You have received a dataset of historical chirps and need to build a model
that can classify new chirps.

Two files have been provided:

  • train.csv – historical chirps with a label (problematic = 1 / normal = 0)
  • test.csv – new chirps without labels

Primary goal: predict the probability of a chirp being problematic
(value between 0 and 1, where 0 = definitely normal, 1 = definitely problematic).


📊 Dataset

Each row represents a chirp posted on Chirper Y, with the following fields:

  • id – unique identifier for the chirp
  • chirp – text of the chirp
  • labelonly in train.csv, 1 (problematic) / 0 (normal)

Final goal: predict label for rows in test.csv.


📝 Tasks

The first two subtasks check simple chirp analysis.
The final subtask evaluates classification model performance.


Subtask 1 (10 points)

Determine the length of each chirp in characters.

Display an integer for this subtask.


Subtask 2 (15 points)

Count how many times the character # appears in each chirp
(an important indicator for excessive hashtags, loved by spammers 😄).


Subtask 3 (75 points)

Build a classification model that predicts the probability of a chirp
being problematic (p[0,1]) for each row in test.csv.

Evaluation uses ROC curve and AUC (Area Under the ROC Curve).


🧮 Evaluation

  • AUC ≥ 0.95 → 75 points
  • AUC ≤ 0.80 → 0 points
  • Score is proportional for values in between

Subtasks 1–2 are evaluated exactly (by comparison).


📄 Submission File Format

submission.csv must contain 3 lines for each test row,
corresponding to the 3 subtasks.

Structure:

subtaskID,datapointID,answer

📊 Evaluation Metric: ROC AUC 📈

For Subtask 3, evaluation is based on ROC AUC (Area Under the ROC Curve).
This is a single measure that summarizes classifier performance
across all possible decision thresholds.

How ROC AUC is calculated

  1. Plot the ROC curve, showing:

    • TPR (True Positive Rate) – proportion of correctly detected problematic chirps
    • FPR (False Positive Rate) – proportion of normal chirps incorrectly flagged as problematic
  2. Area under the curve (AUC) is calculated using the trapezoidal rule:

    • Divide the curve into trapezoids using vertical lines at FPR values
      and horizontal lines at TPR values
    • Sum the trapezoid areas to get the final AUC value
  3. Score interpretation:

    • ROC AUC = 1 🏆 → perfect classifier, Melon Husk’s dream platform
    • ROC AUC = 0.5 🎲 → random classifier, as useful as a Chirper poll
    • 0.5 < ROC AUC < 1 📈 → how well the model separates normal vs. problematic chirps
Submit Solution
Upload output file and optionally source code for evaluation.

Submission File

Source Code File (optional)

Sign in to upload a submission.