作者: Mihai Nan
In a world where messages are increasingly short, typed quickly on phones, and shared instantly on social media, the quality of written language begins to degrade. Grammar mistakes, omitted words, and ambiguous phrasing become more and more common.
To analyze and improve such texts, an educational platform is developing an automated system capable of transforming hastily written sentences into correct and clear versions. Your role is to help them by proposing an automatic system that corrects texts from a grammatical perspective.
You have two files:
Each row in train.csv has the following columns:
SampleID — unique identifierText — the original sentence, as written by the userRevisedText — the expert-corrected versionExample:
SampleID,Text,RevisedText
747, "She forgot her umbrella it started to rain.", "She forgot her umbrella, and then it started to rain."
1382, "He could have bought the house if he has enough money.", "He could have bought the house if he had enough money."
241, "I have a meeting with a principal of a school.", "I have a meeting with the principal of the school."
Build a system that generates the grammatically correct version of the sentences in test.csv.
The evaluation system computes the final score by combining two essential aspects:
Evaluation formula:
final_score = 0.7 * cosine_similarity + 0.3 * edit_distance_score
To build an automatic text correction system, the evaluation must consider both the meaning of the sentence and the exact lexical differences. This is where `cosine_similarity` and `edit_distance` come in.
Definition:
Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between them:
A * B is the dot product|A| and |B| are the vector normsIn NLP:
Example:
Text 1: "She forgot her umbrella."
Text 2: "She left her umbrella behind."
Relevance:
Ensures that the corrected text preserves the same meaning as the original.
Definition:
Edit distance measures the minimum number of operations (insertions, deletions, substitutions) needed to transform one string into another.

Example:
Original: "She forgot her umbrella it started to rain."
Corrected: "She forgot her umbrella, and then it started to rain."
Relevance:
Reflects lexical differences and grammatical correctness.
Final score formula:
This combination ensures that the corrected text is both grammatically accurate and semantically faithful to the original.
The submission.csv file must contain one row for each test sentence.
The first line should be:
DatapointID, RevisedText
Where:
SampleID from the test1557):1557, "He didn't eat any breakfast this morning."