Text correction

Author: Mihai Nan

Hard

Your best score: N/A

Problem Description

✍️ Text Correction in the Era of Fast Communication 💬

In a world where messages are increasingly short, typed quickly on phones, and shared instantly on social media, the quality of written language begins to degrade. Grammar mistakes, omitted words, and ambiguous phrasing become more and more common.

To analyze and improve such texts, an educational platform is developing an automated system capable of transforming hastily written sentences into correct and clear versions. Your role is to help them by proposing an automatic system that corrects texts from a grammatical perspective.

📁 Dataset

You have two files:

train.csv — contains incorrect texts and their corrected versions
test.csv — contains incorrect texts for which the model must generate corrected versions

Each row in train.csv has the following columns:

SampleID — unique identifier
Text — the original sentence, as written by the user
RevisedText — the expert-corrected version

Example:

SampleID,Text,RevisedText
747, "She forgot her umbrella it started to rain.", "She forgot her umbrella, and then it started to rain."
1382, "He could have bought the house if he has enough money.", "He could have bought the house if he had enough money."
241, "I have a meeting with a principal of a school.", "I have a meeting with the principal of the school."

🧠 Task (100 points)

Build a system that generates the grammatically correct version of the sentences in test.csv.

The evaluation system computes the final score by combining two essential aspects:

semantic similarity
lexical differences

Evaluation formula:

final_score = 0.7 * cosine_similarity + 0.3 * edit_distance_score

final_score ≥ 0.95 → 100 points
final_score ≤ 0.9 → 0 points
Intermediate values receive proportional scores.

🔹 Cosine Similarity and Edit Distance in Text Correction

To build an automatic text correction system, the evaluation must consider both the meaning of the sentence and the exact lexical differences. This is where `cosine_similarity` and `edit_distance` come in.

Cosine Similarity

Definition:
Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between them:

$Cosine Similarity$

A * B is the dot product
|A| and |B| are the vector norms

In NLP:

Each sentence can be represented using a vector embedding (e.g., BERT, Sentence-BERT).
Cosine similarity indicates how semantically similar two sentences are, even if they use different words.

Example:

Text 1: "She forgot her umbrella."
Text 2: "She left her umbrella behind."

Semantically: same idea → high cosine similarity

Relevance:
Ensures that the corrected text preserves the same meaning as the original.

Edit Distance (Levenshtein Distance)

Definition:
Edit distance measures the minimum number of operations (insertions, deletions, substitutions) needed to transform one string into another.

Edit Distance

Example:

Original: "She forgot her umbrella it started to rain."
Corrected: "She forgot her umbrella, and then it started to rain."

Operations: comma insertions + addition of "and then"
Edit distance = 6 characters

Relevance:
Reflects lexical differences and grammatical correctness.

Combined Final Metric

Final score formula:

$Final Score$

0.7 Cosine Similarity: prioritizes sentence meaning
0.3 Edit Distance: checks lexical fidelity and correction quality

This combination ensures that the corrected text is both grammatically accurate and semantically faithful to the original.

📄 Submission File Format

The submission.csv file must contain one row for each test sentence.

The first line should be:

DatapointID, RevisedText

Where:

DatapointID — the SampleID from the test
RevisedText — the corrected sentence

Example (SampleID = `1557`):

1557, "He didn't eat any breakfast this morning."

Files

Submit Solution

Upload output file and optionally source code for evaluation.

Submission File

Click to upload or drag and drop

CSV, ZIP, etc. (MAX. 100MB)

Source Code File (optional)