מחבר: Mihai Nan
The editorial office of an international news agency manages thousands of articles daily from various fields: economics, politics, science, technology, and environment. To be able to quickly archive and distribute information, each article must be classified into a thematic category.
Due to a technical malfunction, the labels of some recent articles have been lost. The editorial office turns to you to build an intelligent system that can automatically classify news articles based on their content.
Two input files are given:
train.csv – contains news articles for which the category is knowntest.csv – contains news articles without categoryEach article is identified by a unique id and has associated text.
Using the data from train.csv, you must build a classification model that predicts the label (label) of each article from test.csv.
The result will be saved in a submission.csv file.
train.csvContains the following columns:
id – unique identifier of the article (string, e.g. 000001)text – content of the articlelabel – category of the article (integer number)Example:
id,text,label
000001,"Wall St. Bears Claw Back Into the Black (Reuters)...",2
000002,"Carlyle Looks Toward Commercial Aerospace (Reuters)...",2
000003,"Oil and Economy Cloud Stocks' Outlook (Reuters)...",2
test.csvContains the following columns:
id – unique identifiertext – content of the articleExample:
id,text
120001,"Fears for T N pension after talks Unions represent..."
120002,"The Race is On: Second Private Team Sets Launch..."
120003,"Ky. Company Wins Grant to Study Peptides (AP)..."
submission.csvThe file generated for submission must be in csv format and contain the following:
id – article identifierlabel – predicted categoryExample:
id,label
120001,2
120002,3
120003,3
train.csv.Predictions will be compared with the real labels and accuracy will be calculated:
accuracy = (number_of_correct_predictions / total_number_of_predictions)
The final score is calculated based on the obtained accuracy using the following rules: