Text correction
Author: Mihai Nan
✍️ Text Correction in the Era of Fast Communication 💬
In a world where messages are increasingly short, typed quickly on phones, and shared instantly on social media, the quality of written language begins to degrade. Grammar mistakes, omitted words, and ambiguous phrasing become more and more common.
To analyze and improve such texts, an educational platform is developing an automated system capable of transforming hastily written sentences into correct and clear versions. Your role is to help them by proposing an automatic system that corrects texts from a grammatical perspective.
📁 Dataset
You have two files:
- train.csv — contains incorrect texts and their corrected versions
- test.csv — contains incorrect texts for which the model must generate corrected versions
Each row in train.csv has the following columns:
SampleID— unique identifierText— the original sentence, as written by the userRevisedText— the expert-corrected version
Example:
SampleID,Text,RevisedText
747, "She forgot her umbrella it started to rain.", "She forgot her umbrella, and then it started to rain."
1382, "He could have bought the house if he has enough money.", "He could have bought the house if he had enough money."
241, "I have a meeting with a principal of a school.", "I have a meeting with the principal of the school."
🧠 Task (100 points)
Build a system that generates the grammatically correct version of the sentences in test.csv.
The evaluation system computes the final score by combining two essential aspects:
- semantic similarity
- lexical differences
Evaluation formula:
final_score = 0.7 * cosine_similarity + 0.3 * edit_distance_score
- final_score ≥ 0.95 → 100 points
- final_score ≤ 0.9 → 0 points
- Intermediate values receive proportional scores.
🔹 Cosine Similarity and Edit Distance in Text Correction
To build an automatic text correction system, the evaluation must consider both the meaning of the sentence and the exact lexical differences. This is where `cosine_similarity` and `edit_distance` come in.
Cosine Similarity
Definition:
Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between them:
A * Bis the dot product|A|and|B|are the vector norms
In NLP:
- Each sentence can be represented using a vector embedding (e.g., BERT, Sentence-BERT).
- Cosine similarity indicates how semantically similar two sentences are, even if they use different words.
Example:
Text 1: "She forgot her umbrella."
Text 2: "She left her umbrella behind."
- Semantically: same idea → high cosine similarity
Relevance:
Ensures that the corrected text preserves the same meaning as the original.
Edit Distance (Levenshtein Distance)
Definition:
Edit distance measures the minimum number of operations (insertions, deletions, substitutions) needed to transform one string into another.

Example:
Original: "She forgot her umbrella it started to rain."
Corrected: "She forgot her umbrella, and then it started to rain."
- Operations: comma insertions + addition of "and then"
- Edit distance = 6 characters
Relevance:
Reflects lexical differences and grammatical correctness.
Combined Final Metric
Final score formula:
- 0.7 Cosine Similarity: prioritizes sentence meaning
- 0.3 Edit Distance: checks lexical fidelity and correction quality
This combination ensures that the corrected text is both grammatically accurate and semantically faithful to the original.
📄 Submission File Format
The submission.csv file must contain one row for each test sentence.
The first line should be:
DatapointID, RevisedText
Where:
- DatapointID — the
SampleIDfrom the test - RevisedText — the corrected sentence
Example (SampleID = 1557):
1557, "He didn't eat any breakfast this morning."