Language identification of a text
Author: Mihai Nan
✍️ Language Identification of a Text 🌐
In a globalized world, messages come from all corners of the world and
can be written in various languages. To automatically process these
texts (for example, in automated translation applications), it is
essential to be able to identify the language of each text.
Your role is to develop an automatic system capable of determining the
language of a text, based on training a model using a set of labeled
examples.
🌍 Possible Languages
The automatic system must work for the following languages:
Swedish, French, Korean, Japanese, Portuguese, English,
Persian, Pushto, Thai, Romanian, Tamil, Spanish, Turkish,
Estonian, Chinese, Arabic, Urdu, Hindi, Latin, Russian,
Indonesian, Dutch
📁 Dataset
You have two CSV files available:
- train.csv -- contains texts and their languages
- test.csv -- contains texts for which the model must predict the
language
Each row in train.csv has the following columns:
SampleID-- the unique identifier of the text\Text-- the original text\language-- the language of the text
Example:
SampleID,Text,language
S1,"klement gottwaldi surnukeha palsameeriti ning ...",Estonian
S2,"sebes joseph pereira thomas på eng the jesuit...",Swedish
S3,"de spons behoort tot het geslacht haliclona en...",Dutch
Each row in test.csv has the following columns:
SampleID-- the unique identifier of the text\Text-- the text for which the language must be predicted
Example:
SampleID,Text
S1001,"ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ..."
S1002,"விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர..."
🧠 Task (100 points)
Build a system that can identify the language of the text for the
texts in test.csv.
Predictions must be saved in a submission.csv file with the format:
SampleID,language
S1001,Thai
S1002,Tamil
S1003,Swedish
where:
SampleID-- the unique identifier of the text intest.csvlanguage-- the language predicted by your system, which must be
one of the following possible languages:
Swedish, French, Korean, Japanese, Portuguese, English,
Persian, Pushto, Thai, Romanian, Tamil, Spanish, Turkish,
Estonian, Chinese, Arabic, Urdu, Hindi, Latin, Russian,
Indonesian, Dutch
📊 Evaluation
Predictions will be compared with the real languages, and accuracy
will be calculated:
accuracy = (number_of_correct_predictions / total_number_of_predictions)
The final score is calculated based on the accuracy obtained using the
following rules:
- accuracy ≥ 0.98 → 100 points
- accuracy ≤ 0.9 → 0 points
- Intermediate values receive a proportional score between 0 and 100.