Similarity Oracle
Autore: Mihai Nan
📚 Similarity Oracle 🔮
In the Great Kingdom Library, archivists face an increasingly difficult task:
thousands of manuscripts, scrolls, and ancient chronicles must be organized, compared, and indexed.
To assist them, the ancestors created a legendary mechanism: Similarity Oracle. This device can determine, with remarkable accuracy, how similar two text fragments are, returning a score between 0 and 5.
However, the mechanism has deteriorated over time, and the Great Council of Archivists decided a modern version, built using machine learning, is needed.
For this, you are provided with two files:
train.csv- scrolls manually annotated by scribestest.csv- new pairs to be analyzed by your solution
Your task is to reconstruct the Oracle through a series of analyses and a final predictive model.
📊 Dataset
Each row in train.csv and test.csv represents a pair of sentences:
- sampleID – unique identifier of the pair
- sentence1 – first text
- sentence2 – second text
- score – only in train.csv, a real value between 0 and 5 indicating similarity
The ultimate goal is to predict score for each row in test.csv.
📝 Tasks
For the first subtasks, you need to analyze the data provided in train.csv and test.csv
and extract different relevant information about the test sentences.
Subtask 1 (10 points)
Compute the length of each sentence (sentence1 and sentence2) for each row in the test set
and label the pair based on the average length:
Shortif average < 50Mediumif 50 ≤ average < 100Longif average ≥ 100
Subtask 2 (15 points)
For each row in test, compute the number of words in each sentence (sentence1 and sentence2) and calculate the total using:
total_words = num_words(sentence1) + num_words(sentence2)
Subtask 3 (15 points)
Determine the absolute difference between the length of sentence1 and sentence2
for each row in test.
In other words, for each row compute:
|num_chars(sentence1) - num_chars(sentence2)|
Subtask 4 (60 points)
Build a machine learning model capable of predicting the numeric score
for each row in test.csv.
Final evaluation will be based on MAE (Mean Absolute Error), calculated as:
🧮 Evaluation
Metric for Subtask 4:
- MAE (Mean Absolute Error) (the lower the value, the higher the points).
- If the model achieves MAE ≤ 0.65, the solution gets 60 points.
- If the model achieves MAE ≥ 2.0, the solution gets 0 points.
- For intermediate MAE values, points are awarded proportionally.
For subtasks 1-3, answers are evaluated exactly.
📌 Notes
- Preprocessing such as tokenization, embeddings, pretrained models is allowed.
📄 Submission Format
The submission.csv file must contain 4 rows for each test row,
corresponding to the 4 subtasks.
Structure:
subtaskID datapointID answer
Column meanings:
-
subtaskID – a number from 1 to 4
-
datapointID –
sampleIDfrom test -
answer – the result:
- Subtask 1:
Short/Medium/Long - Subtask 2: integer
- Subtask 3: integer
- Subtask 4: real value (predicted score)
- Subtask 1:
Example for sampleID = 1714:
subtaskID datapointID answer
1 1714 Medium
2 1714 15
3 1714 3
4 1714 3.74
Good luck! The Great Council of Archivists counts on you to revive the Similarity Oracle! 🧙♂️