News Editorial
Szerző: Mihai Nan
Story
The editorial office of an international news agency manages thousands of articles daily from various fields: economics, politics, science, technology, and environment. To be able to quickly archive and distribute information, each article must be classified into a thematic category.
Due to a technical malfunction, the labels of some recent articles have been lost. The editorial office turns to you to build an intelligent system that can automatically classify news articles based on their content.
Requirement
Two input files are given:
train.csv– contains news articles for which the category is knowntest.csv– contains news articles without category
Each article is identified by a unique id and has associated text.
Using the data from train.csv, you must build a classification model that predicts the label (label) of each article from test.csv.
The result will be saved in a submission.csv file.
File formats
train.csv
Contains the following columns:
id– unique identifier of the article (string, e.g.000001)text– content of the articlelabel– category of the article (integer number)
Example:
id,text,label
000001,"Wall St. Bears Claw Back Into the Black (Reuters)...",2
000002,"Carlyle Looks Toward Commercial Aerospace (Reuters)...",2
000003,"Oil and Economy Cloud Stocks' Outlook (Reuters)...",2
test.csv
Contains the following columns:
id– unique identifiertext– content of the article
Example:
id,text
120001,"Fears for T N pension after talks Unions represent..."
120002,"The Race is On: Second Private Team Sets Launch..."
120003,"Ky. Company Wins Grant to Study Peptides (AP)..."
submission.csv
The file generated for submission must be in csv format and contain the following:
id– article identifierlabel– predicted category
Example:
id,label
120001,2
120002,3
120003,3
Notes
- Labels are integer numerical values, and their meaning must be deduced exclusively from
train.csv. - Any natural language processing and machine learning methods are permitted.
- Solution evaluation is based on prediction accuracy.
Evaluation
Predictions will be compared with the real labels and accuracy will be calculated:
accuracy = (number_of_correct_predictions / total_number_of_predictions)
The final score is calculated based on the obtained accuracy using the following rules:
- accuracy ≥ 0.98 → 100 points
- accuracy ≤ 0.9 → 0 points
- For intermediate values, proportional scores between 0 and 100 are awarded.