Enhanced Visual Word Sense Disambiguation for Italian

University of Bari
EVALITA 2026
Code

Video Presentation

Introduction

Enhanced-VWSD for Italian is a task of EVALITA 2026. The task is based on the VWSD task proposed at SemEval 2023. In this task, the objective is to select an image, out of ten possible candidates, which correctly represents the sense of a target word in an input sentence. The sentence consists of a word of interest (that is, the target to disambiguate) and additional context words to support disambiguation.

We propose a new task that combines both high-level and fine-grained semantics. The goal is not only to identify the broad sense of the target word, but also to accurately recognise its specific sense. We propose using co-Hyponyms extracted from a semantic network to find hard negatives. That is, the images are related to the general sense of the target word, but they represent a different specific sense. We propose to improve the task by mixing the two types of images: 1. images for other senses of the target word (therefore, the same as the original VWSD challenge); 2. images that share the same broad sense as the target word.

Example of instance for EVWSD-ITA

Example of instance for EVWSD-ITA (limited to seven images rather than ten for visualization purposes)

Data and Further Information

Task Details: Given one query consisting of three or more words, find the correct image described by the query out of the ten possible ones.

Query Details: To create the query we combine: the lemma of the correct synset, one of the lemmas of the hypernym of the correct synset, and a word from the gloss of the correct synset. This process is done manually for the test set and automatically for the train set.

Images Details: All images will be resized to 336x336 to strike a balance between effectiveness and efficiency.

Use of External Data: Usage of data from other sources is allowed.
UPDATE: This train set and the future test set were created by leveraging BabelNet. To limit possible train-test set contamination, we ask the participants to not use BabelNet as data source for train set augmentation.

Copyright: Our work leverages BabelNet, the data is subject to its license. The BabelNet Non-Commercial License allows users to share and adapt data, as long as they belong to a research institution.

Evaluation metrics: HIT@1 and MRR. We will have two different leaderboards for each metric.
Given r = [r1, ..., rn], where n is the cardinality of the test set and ri is the rank of the correct image given as output by the model, MRR is defined as: $$ MRR = \frac{1}{n} \sum_{i=1}^{n}{\frac{1}{r_i}} $$ This metric is used to evaluate the goodness of the ranking; the closer ri is to 1 (i.e., the first position in the ranking), the better the result. HIT@1 is defined as: $$ HIT@1 = \frac{1}{n} \sum_{i=1}^{n}{I(r_i)} $$ where I is a function that returns 1 if ri == 1 (i.e. the correct image is ranked first), and 0 otherwise. Therefore, this metric assesses the model's ability to select the correct image as the best possible candidate.

Training data has been released! You can find it at the following link.

BibTeX

TBD

Acknowledgements

TBD