We at DAIH delve into the advancement of Artificial Intelligence (AI) methodologies, techniques, and tools, focusing, among others, on their application to Digital Humanities (e.g., (Lorenzini et al., 2021)) and particularly those pertaining to languages and literatures. Our primary research streams encompass the development of tools and techniques for:
- Natural language processing and the extraction of high-quality knowledge from textual resources (e.g., (Bombieri et al., 2023; Rospocher, 2021; Rospocher & Corcoglioniti, 2020));
- The establishment of robust representations of extracted knowledge and automated reasoning for deriving new insights (e.g., (Corcoglioniti et al., 2016; Rospocher et al., 2016));
- Developing resources and applications in accordance with Semantic Web and Linked Data best practices (e.g., (Rospocher et al., 2019));
- Facilitating effective information and document retrieval (e.g., (Rospocher et al., 2019)).
Furthermore, we explore the influence of Artificial Intelligence (AI) and Natural Language Processing (NLP) on diversity, equity, and inclusion, an area of growing importance in contemporary computer science and artificial intelligence research. On one hand, we aim to scrutinize the potential challenges that artificial intelligence techniques may introduce or exacerbate in terms of inclusion. This involves an examination of issues and social biases (such as those pertaining to gender, race, religion, or disabilities) present in computational models trained on huge datasets. Conversely, we seek the development and implementation of concrete tools, methods, technologies, and computational models for automating tasks that foster inclusion and enhance content accessibility.
Moreover, we are also dedicated to analyzing biases towards Artificial Intelligence. For instance, we study the reactions and perceptions of readers towards narratives authored by generative Artificial Intelligence language models. This exploration seeks to comprehend how readers assess such texts in terms of likability, emotional resonance, artistic merit, and inclusivity.
References
-
Matteo Lorenzini, Marco Rospocher, Sara Tonelli
International Journal on Digital Libraries, 2021
Metadata are fundamental for the indexing, browsing and retrieval of cultural heritage resources in repositories, digital libraries and catalogues. In order to be effectively exploited, metadata information has to meet some quality standards, typically defined in the collection usage guidelines. As manually checking the quality of metadata in a repository may not be affordable, especially in large collections, in this paper we specifically address the problem of automatically assessing the quality of metadata, focusing in particular on textual descriptions of cultural heritage items. We describe a novel approach based on machine learning that tackles this problem by framing it as a binary text classification task aimed at evaluating the accuracy of textual descriptions. We report our assessment of different classifiers using a new dataset that we developed, containing more than 100K descriptions. The dataset was extracted from different collections and domains from the Italian digital library “Cultura Italia”and was annotated with accuracy information in terms of compliance with the cataloguing guidelines. The results empirically confirm that our proposed approach can effectively support curators (F1 0.85) in assessing the quality of the textual descriptions of the records in their collections and provide some insights into how training data, specifically their size and domain, can affect classification performance.
-
Marco Bombieri, Marco Rospocher, Simone Paolo Ponzetto, Paolo Fiorini
Computers in Biology and Medicine, 2023
The automatic extraction of procedural surgical knowledge from surgery manuals, academic papers or other high-quality textual resources, is of the utmost importance to develop knowledge-based clinical decision support systems, to automatically execute some procedure’s step or to summarize the procedural information, spread throughout the texts, in a structured form usable as a study resource by medical students. In this work, we propose a first benchmark on extracting detailed surgical actions from available intervention procedure textbooks and papers. We frame the problem as a Semantic Role Labeling task. Exploiting a manually annotated dataset, we apply different Transformer-based information extraction methods. Starting from RoBERTa and BioMedRoBERTa pre-trained language models, we first investigate a zero-shot scenario and compare the obtained results with a full fine-tuning setting. We then introduce a new ad-hoc surgical language model, named SurgicBERTa, pre-trained on a large collection of surgical materials, and we compare it with the previous ones. In the assessment, we explore different dataset splits (one in-domain and two out-of-domain) and we investigate also the effectiveness of the approach in a few-shot learning scenario. Performance are evaluated on three correlated sub-tasks: predicate disambiguation, semantic argument disambiguation and predicate-argument disambiguation. Results show that the fine-tuning of a pre-trained domain-specific language model achieves the highest performance on all splits and on all sub-tasks. All models are publicly released.
-
Marco Rospocher
Expert Systems with Applications, 2021
In this paper, we investigate the problem of automatically detecting explicit song lyrics, i.e., determining if the lyrics of a given song could be offensive or unsuitable for children. The problem can be framed as a binary classification task, and in this work we propose to tackle it with the fastText classifier, an efficient linear classification model leveraging a peculiar distributional text representation that, by exploiting subword information in building the embeddings of the words, enables to cope with words not seen at training time. We assess the performance of the fastText classifier and word representations with a lyrics dataset of over 800K songs, annotated with explicit information, that we assembled from publicly available resources. The evaluation shows that the fastText classifier is effective for explicit lyrics detection, substantially outperforming a reference approach for the task, and that the subword information effectively contributes to this result.
-
Marco Rospocher, Francesco Corcoglioniti
Journal of Web Semantics, 2020
In this work we address the problem of extracting quality entity knowledge from natural language text, an important task for the automatic construction of knowledge graphs from unstructured content. More in details, we investigate the benefit of performing a joint posterior revision, driven by ontological background knowledge, of the annotations resulting from natural language processing (NLP) entity analyses such as named entity recognition and classification (NERC) and entity linking (EL). The revision is performed via a probabilistic model, called jpark, that given the candidate annotations independently identified by NERC and EL tools on the same textual entity mention, reconsiders the best annotation choice performed by the tools in light of the coherence of the candidate annotations with the ontological knowledge. The model can be explicitly instructed to handle the information that an entity can potentially be NIL (i.e., lacking a corresponding referent in the target linking knowledge base), exploiting it for predicting the best NERC and EL annotation combination. We present a comprehensive evaluation of jpark along various dimensions, comparing its performances with and without exploiting NIL information, as well as the usage of three different background knowledge resources (YAGO, DBpedia, and Wikidata) to build the model. The evaluation, conducted using different tools (the popular Stanford NER and DBpedia Spotlight, as well as the more recent Flair NER and End-to-End Neural EL) with three reference datasets (AIDA, MEANTIME, and TAC-KBP), empirically confirms the capability of the model to improve the quality of the annotations of the given tools, and thus their performances on the tasks they are designed for.
-
Francesco Corcoglioniti, Marco Rospocher, Alessio Palmero Aprosio
IEEE Transactions on Knowledge and Data Engineering, 2016
We present an approach for ontology population from natural language English texts that extracts RDF triples according to FrameBase, a Semantic Web ontology derived from FrameNet. Processing is decoupled in two independently-tunable phases. First, text is processed by several NLP tasks, including Semantic Role Labeling (SRL), whose results are integrated in an RDF graph of mentions, i.e., snippets of text denoting some entity/fact. Then, the mention graph is processed with SPARQL-like rules using a specifically created mapping resource from NomBank/PropBank/FrameNet annotations to FrameBase concepts, producing a knowledge graph whose content is linked to DBpedia and organized around semantic frames, i.e., prototypical descriptions of events and situations. A single RDF/OWL representation is used where each triple is related to the mentions/tools it comes from. We implemented the approach in PIKES, an open source tool that combines two complementary SRL systems and provides a working online demo. We evaluated PIKES on a manually annotated gold standard, assessing precision/recall in (i) populating FrameBase ontology, and (ii) extracting semantic frames modeled after standard predicate models, for comparison with state-of-the-art tools for the Semantic Web. We also evaluated (iii) sampled precision and execution times on a large corpus of 110 K Wikipedia-like pages.
-
Marco Rospocher, Marieke Erp, Piek Vossen, Antske Fokkens, Itziar Aldabe, German Rigau, Aitor Soroa, Thomas Ploeger, Tessel Bogaard
Web Semantics: Science, Services and Agents on the World Wide Web, 2016
-
Marco Rospocher, Francesco Corcoglioniti, Alessio Palmero Aprosio
Language Resources and Evaluation, 2019
-
Marco Rospocher, Francesco Corcoglioniti, Mauro Dragoni
Semantic Web - Interoperability, Usability, Applicability, 2019
We present an approach for ontology population from natural language English texts that extracts RDF triples according to FrameBase, a Semantic Web ontology derived from FrameNet. Processing is decoupled in two independently-tunable phases. First, text is processed by several NLP tasks, including Semantic Role Labeling (SRL), whose results are integrated in an RDF graph of mentions, i.e., snippets of text denoting some entity/fact. Then, the mention graph is processed with SPARQL-like rules using a specifically created mapping resource from NomBank/PropBank/FrameNet annotations to FrameBase concepts, producing a knowledge graph whose content is linked to DBpedia and organized around semantic frames, i.e., prototypical descriptions of events and situations. A single RDF/OWL representation is used where each triple is related to the mentions/tools it comes from. We implemented the approach in PIKES, an open source tool that combines two complementary SRL systems and provides a working online demo. We evaluated PIKES on a manually annotated gold standard, assessing precision/recall in (i) populating FrameBase ontology, and (ii) extracting semantic frames modeled after standard predicate models, for comparison with state-of-the-art tools for the Semantic Web. We also evaluated (iii) sampled precision and execution times on a large corpus of 110 K Wikipedia-like pages.