Metrics

Metrics used for the evaluation of results

Home / Metrics

We evaluate the three subtasks (ATE, ABSA and SA) separately by comparing the results providedby the participant systems to the gold standard annotations of the test set as follow. The python evaluation script will be released contestually with the training and dev set.

TASK 1 - ATE: Aspect Term Extraction

For the ATE task, we compute Precision, Recall and F1-score defined as:

\(F1_{a} = \frac{2 P_a R_a}{P_a + R_a}\)

\(P_a = \frac{|S_{a} \cap G_a|+0.5*|PAR_a|}{|S_a|}\)

\(R_a = \frac{|S_a \cap G_a|+0.5*|PAR_a|}{|G_a|}\)

Here \(S_a\) is the set of aspect term annotations that a system returned for all the test sentences, \(G_a\) is the set of the gold (correct) aspect term annotations and \(PAR_a\) is the set of partial matches (predicted and gold aspect terms have some overlapping text).

For instance, if a review is labeled in the gold standard with the two aspect terms \(G_a=\{\textit{costruzione}, \textit{mantenere la temperatura}\}\), and the system predicts the two aspects \(S_a=\{\textit{costruzione},\textit{temperatura}\}\), we have that \(|S_{a} \cap G_a|=1, |PAR_a|=1, |G_{a}|=2\) and \(|S_{a}|=2\), so that \(P_a=\frac{1.5}{2}=0.75\), \(R_a=\frac{1.5}{2}=0.75\) and \(F1_a=\frac{1.5}{2}=0.75\).

Only the F1 score will be used for ranking the results of the partecipant during the final leaderboard creation.

TASK 2 - ABSA: Aspect-based Sentiment Analysis

For the ABSA task (Task 2), we will evaluate the entire chain, thus considering both the aspect terms detected in the sentences together with their corresponding polarity, in the form of \((aspect, polarity)\) pairs. We again compute Precision, Recall and F1-score now defined as:

\(F1_{p} = \frac{2 P_p R_p}{P_p + R_p}\)

\(P_p = \frac{|S_{p} \cap G_p|+0.5*|PAR_p|}{|S_p|}\)

\(R_p = \frac{|S_p \cap G_p|+0.5*|PAR_p|}{|G_p|}\)

where \(S_p\) is the set of \((aspect, polarity)\) pairs that a system returned for all the test sentences, \(G_a\) is the set of the gold (correct) pairs annotations and \(PAR_p\) is the set of \((aspect,polarity)\) pairs with a partial match.

For instance, if a review is labeled in the gold standard with the pairs \(G_p=\{(\textit{mantenere la temperatura}, POS), (\textit{costruzione}, POS)\}\), and the system predicts the three pairs \(S_p=\{(\textit{temperatura}, NEG), (\textit{costruzione}, POS),(\textit{acquisto}, POS)\}\), we have that \(|S_{p} \cap G_p|=1\), \(|PAR_p|=0\) , \(|G_{p}|=2\) and \(|S_{p}|=3\) so that \(P_p=\frac{1}{3}\), \(R_p=\frac{1}{2}\) and \(F1_p=0.4\).

Only the F1 score will be used for ranking the results of the partecipant during the final leaderboard creation.

TASK 3 - SA: Sentiment Analysis

For the SA task (Task 3), we evaluate the whole polarity of the reviews by computing the Root Mean Squared Error \((RMSE)\): \( RMSE = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{\Big(\frac{d_i -f_i}{\sigma_i}\Big)^2}}\)