Module B

Module B: Text summarization system: COMPENDIUM (Resp. E. Lloret)

A summary can be defined as “a coherent text that contains the overall gist of a document being shorter than the original one” (Mani, 2001). Based on the expertise and previous research carried out into automatic summarization by the research team members outlined in Section 2 “The team´s main contributions that support the proposal”, together with the advancements of the state of the art in this task
(Zamuda and Lloret, 2020; Vicente and Lloret, 2020, Barros et al., 2019), the main goal of this module is to adapt the COMPENDIUM tool (Lloret, Romá-Ferri and Palomar, 2013) to summarize administrative texts written in Spanish.

The COMPENDIUM tool was the result of the summarization approach proposed in the PhD thesis “Text Summarisation based on Human Language Technologies and its Applications” (Lloret, 2011). In short, COMPENDIUM relies on a modular summarization approach that can generate different types of summaries automatically. Concerning the input, COMPENDIUM can either take one or several texts, and produce
single- or multi-document summaries, respectively. Regarding the purpose of the resulting summaries, these can be either generic, query-focused or sentiment-based, and their aim is to provide information about the source document(s), thereby being informative. As output, the final summaries can be pure extracts or a combination of extractive and abstractive information. To show its potential and validity,
COMPENDIUM has been successfully applied to several domains and tasks (Lloret and Palomar, 2013). The intellectual property of COMPENDIUM is registered and protected. Currently, it has a Technology Readiness Level (TRL) around 2-3, and it can be tested through an online demo
(http://gplsi.dlsi.ua.es/demos/compendium/). However, it is important to mention that COMPENDIUM was originally developed for the English language, so there is room for improvement to adapt it to other types of more domain and languages, such as Spanish public sector documents, as proposed for this project. Therefore, the following two tasks are included in this module.

Task B1. Adapting COMPENDIUM to summarize documents of the public administrations (Resp. E. Lloret)

The main objective of this task is to adapt and fine-tune COMPENDIUM’s approach to enable effective summarization of Spanish text documents generated by public sector organizations. The type of summary to be generated will be abstractive, which means that not only the relevant information will be extracted, but also, such relevant information will be paraphrased using different vocabulary and structures,
guaranteeing that the same meaning in the generated text.

To summarize the text, we will take as a basis the core stages originally developed in COMPENDIUM, which constitute the backbone of the summarization process. These stages are as follows: a) surface linguistic analysis; b) redundancy detection; c) topic identification; d) relevance detection; and e) summary generation. Then, to enhance the capabilities of the approach, an additional stage of “information
compression and fusion”, responsible for generating abstractive summaries, can be also integrated.

Task B2. Integrate COMPENDIUM in an accessible and easy-to-use platform (Resp. A. Suárez)

Once the COMPENDIUM tool has been adapted for Spanish public sector documents (task B1), the goal of this task is to integrate the generated software into a user-friendly platform, so that it can be accessible and easy to use. For this, we will take into account the Web Accessibility Guidelines defined by the World Wide Web Consortium (W3C) (https://www.w3.org/WAI/fundamentals/accessibility-intro/), not only for the design of the Web interface, but also for the best way to output the content generated as the resulting summary, meeting the standards defined in: https://www.w3.org/WAI/standards-guidelines/wcag/.

Results of this module:

COMPENDIUM system and accessible and easy-to-use interface for Spanish public sector documentation.