Skip to main content

Language identification of a text

Author: Mihai Nan

Medium
Your best score: N/A
Problem Description

✍️ Language Identification of a Text 🌐

In a globalized world, messages come from all corners of the world and
can be written in various languages. To automatically process these
texts (for example, in automated translation applications), it is
essential to be able to identify the language of each text.

Your role is to develop an automatic system capable of determining the
language of a text, based on training a model using a set of labeled
examples.


🌍 Possible Languages

The automatic system must work for the following languages:

Swedish, French, Korean, Japanese, Portuguese, English,
Persian, Pushto, Thai, Romanian, Tamil, Spanish, Turkish,
Estonian, Chinese, Arabic, Urdu, Hindi, Latin, Russian,
Indonesian, Dutch


📁 Dataset

You have two CSV files available:

  • train.csv -- contains texts and their languages
  • test.csv -- contains texts for which the model must predict the
    language

Each row in train.csv has the following columns:

  • SampleID -- the unique identifier of the text\
  • Text -- the original text\
  • language -- the language of the text

Example:

SampleID,Text,language
S1,"klement gottwaldi surnukeha palsameeriti ning ...",Estonian
S2,"sebes joseph pereira thomas på eng the jesuit...",Swedish
S3,"de spons behoort tot het geslacht haliclona en...",Dutch

Each row in test.csv has the following columns:

  • SampleID -- the unique identifier of the text\
  • Text -- the text for which the language must be predicted

Example:

SampleID,Text
S1001,"ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ..."
S1002,"விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர..."

🧠 Task (100 points)

Build a system that can identify the language of the text for the
texts in test.csv.

Predictions must be saved in a submission.csv file with the format:

SampleID,language
S1001,Thai
S1002,Tamil
S1003,Swedish

where:

  • SampleID -- the unique identifier of the text in test.csv
  • language -- the language predicted by your system, which must be
    one of the following possible languages:

Swedish, French, Korean, Japanese, Portuguese, English,
Persian, Pushto, Thai, Romanian, Tamil, Spanish, Turkish,
Estonian, Chinese, Arabic, Urdu, Hindi, Latin, Russian,
Indonesian, Dutch


📊 Evaluation

Predictions will be compared with the real languages, and accuracy
will be calculated:

accuracy = (number_of_correct_predictions / total_number_of_predictions)

The final score is calculated based on the accuracy obtained using the
following rules:

  • accuracy ≥ 0.98 → 100 points
  • accuracy ≤ 0.9 → 0 points
  • Intermediate values receive a proportional score between 0 and 100.
Submit Solution
Upload output file and optionally source code for evaluation.

Submission File

Source Code File (optional)

Sign in to upload a submission.