For the ATE task, we compute Precision, Recall and F1-score defined as:
\(F1_{a} = \frac{2 P_a R_a}{P_a + R_a}\)
\(P_a = \frac{|S_{a} \cap G_a|+0.5*|PAR_a|}{|S_a|}\)
\(R_a = \frac{|S_a \cap G_a|+0.5*|PAR_a|}{|G_a|}\)
Here \(S_a\) is the set of aspect term annotations that a system returned for all the test sentences, \(G_a\) is the set of the gold (correct) aspect term annotations and \(PAR_a\) is the set of partial matches (predicted and gold aspect terms have some overlapping text).
For instance, if a review is labeled in the gold standard with the two aspect terms \(G_a=\{\textit{costruzione}, \textit{mantenere la temperatura}\}\), and the system predicts the two aspects \(S_a=\{\textit{costruzione},\textit{temperatura}\}\), we have that \(|S_{a} \cap G_a|=1, |PAR_a|=1, |G_{a}|=2\) and \(|S_{a}|=2\), so that \(P_a=\frac{1.5}{2}=0.75\), \(R_a=\frac{1.5}{2}=0.75\) and \(F1_a=\frac{1.5}{2}=0.75\).
Only the F1 score will be used for ranking the results of the partecipant during the final leaderboard creation.
For the ABSA task (Task 2), we will evaluate the entire chain, thus considering both the aspect terms detected in the sentences together with their corresponding polarity, in the form of \((aspect, polarity)\) pairs. We again compute Precision, Recall and F1-score now defined as:
\(F1_{p} = \frac{2 P_p R_p}{P_p + R_p}\)
\(P_p = \frac{|S_{p} \cap G_p|+0.5*|PAR_p|}{|S_p|}\)
\(R_p = \frac{|S_p \cap G_p|+0.5*|PAR_p|}{|G_p|}\)
where \(S_p\) is the set of \((aspect, polarity)\) pairs that a system returned for all the test sentences, \(G_a\) is the set of the gold (correct) pairs annotations and \(PAR_p\) is the set of \((aspect,polarity)\) pairs with a partial match.
For instance, if a review is labeled in the gold standard with the pairs \(G_p=\{(\textit{mantenere la temperatura}, POS), (\textit{costruzione}, POS)\}\), and the system predicts the three pairs \(S_p=\{(\textit{temperatura}, NEG), (\textit{costruzione}, POS),(\textit{acquisto}, POS)\}\), we have that \(|S_{p} \cap G_p|=1\), \(|PAR_p|=0\)
, \(|G_{p}|=2\) and \(|S_{p}|=3\) so that \(P_p=\frac{1}{3}\), \(R_p=\frac{1}{2}\) and \(F1_p=0.4\).
Only the F1 score will be used for ranking the results of the partecipant during the final leaderboard creation.
For the SA task (Task 3), we evaluate the whole polarity of the reviews by computing the Root Mean Squared Error \((RMSE)\): \( RMSE = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{\Big(\frac{d_i -f_i}{\sigma_i}\Big)^2}}\)