作者: Mihai Nan
In the Great Kingdom Library, archivists face an increasingly difficult task:
thousands of manuscripts, scrolls, and ancient chronicles must be organized, compared, and indexed.
To assist them, the ancestors created a legendary mechanism: Similarity Oracle. This device can determine, with remarkable accuracy, how similar two text fragments are, returning a score between 0 and 5.
However, the mechanism has deteriorated over time, and the Great Council of Archivists decided a modern version, built using machine learning, is needed.
For this, you are provided with two files:
train.csv - scrolls manually annotated by scribestest.csv - new pairs to be analyzed by your solutionYour task is to reconstruct the Oracle through a series of analyses and a final predictive model.
Each row in train.csv and test.csv represents a pair of sentences:
The ultimate goal is to predict score for each row in test.csv.
For the first subtasks, you need to analyze the data provided in train.csv and test.csv
and extract different relevant information about the test sentences.
Compute the length of each sentence (sentence1 and sentence2) for each row in the test set
and label the pair based on the average length:
Short if average < 50Medium if 50 ≤ average < 100Long if average ≥ 100For each row in test, compute the number of words in each sentence (sentence1 and sentence2) and calculate the total using:
total_words = num_words(sentence1) + num_words(sentence2)
Determine the absolute difference between the length of sentence1 and sentence2
for each row in test.
In other words, for each row compute:
|num_chars(sentence1) - num_chars(sentence2)|
Build a machine learning model capable of predicting the numeric score
for each row in test.csv.
Final evaluation will be based on MAE (Mean Absolute Error), calculated as:
Metric for Subtask 4:
For subtasks 1-3, answers are evaluated exactly.
The submission.csv file must contain 4 rows for each test row,
corresponding to the 4 subtasks.
Structure:
subtaskID datapointID answer
Column meanings:
subtaskID – a number from 1 to 4
datapointID – sampleID from test
answer – the result:
Short / Medium / LongsampleID = 1714:subtaskID datapointID answer
1 1714 Medium
2 1714 15
3 1714 3
4 1714 3.74
Good luck! The Great Council of Archivists counts on you to revive the Similarity Oracle! 🧙♂️