The data source chosen for creating the datasets is one of the most popular eCommerce website in Italy.
The platform allows users to share their opinions about items bought through a textual review and a final score of satisfaction. Therefore, the website provides a large number of reviews in many languages, including Italian.
We have collected 4364 real-life user reviews, written in the Italian language, released on a famaous eCommerce platform about 23 products. The training, dev and test sets is randomly generated in the portion: 70% training, 2.5% dev, 27.5% test set. This mean that the test set will be not out-of-domain.
The collected data will be manually annotated by, at least, three different subjects using the inter-agreement metric as the value of quality of the annotation. We do not provide with them a unique id that could be used to retrieve more information about the writer and consequently, we do not violate copyrights and/or we do not have privacy issues. Furthermore, in order not to harm the interests of the manufacturers, we will not disclose specific information about the specific item for which the review has been issued. At the end of the annotation process, we will obtain the gold standard dataset with the associations among sentence, aspects, sentiment and scores.
Fig 1. Data Annotation example
The data format used is NDJSON (http://ndjson.org/) with UTF-8 encoding and newline as delimiter. Note that some reviews may not contain any aspect, but the final review score is always available. An example of annotated data is provided in the code above.