Автор: Mihai Nan
Your company wants to protect users from unwanted emails (spam).
To achieve this, it has been decided to build an automated system that can
identify spam emails and separate them from legitimate (non-spam) emails.
You have been given a labeled dataset of emails and need to build a model
that can classify new emails.
Two files are provided:
label (spam = 1 / nonspam = 0)The main goal is predicting the probability that an email is spam (a value between 0 and 1, where 0 = definitely non-spam, 1 = definitely spam).
Each row represents an email with the following attributes:
sample_id - unique identifiertext - email contentlabel - only in train.csv, 1 (spam) / 0 (non-spam)The final goal is to predict the label for rows in test.csv.
The first two subtasks check simple email analysis.
The last subtask evaluates the classification model.
Determine the length of each email in characters.
Output for this subtask should be an integer.
Count how many times the word free appears in the email.
Build a classification model that predicts the probability that an email is spam (p ∈ [0,1]) for each row in the test set.
Evaluation is done using the ROC curve and AUC (Area Under the ROC Curve).
Subtasks 1–2 are evaluated exactly (by comparison).
The file submission.csv should contain 3 lines for each test row,
corresponding to the 3 subtasks.
Structure:
subtaskID, datapointID, answer
where:
sample_idfree (integer)sample_id = 00042:subtaskID,datapointID,answer
1,00042,342
2,00042,3
3,00042,0.742
For Subtask 3, evaluation is done using ROC AUC (Area Under the ROC Curve).
This is a single measure that summarizes a classifier's performance across all possible decision thresholds.

