RELIANCE Text Mining Services in EOSC

RELIANCE Text Mining Services in EOSC

RELIANCE text mining and enrichment services integrated in EOSC enable researchers from Earth Science Communities and Copernicus Users to leverage the wealth of knowledge in scientific publications and Research Objects.

Enrichment

Try it out

The semantic enrichment process is in charge of generating new metadata out of the text content of files or collections of files, such as Research Objects. This metadata comprise the main concepts found in resources containing text, the main knowledge areas in which these concepts are most frequently used, the main expressions, known in computational linguistics as noun phrases, found in the text, and named entities that are further classified in people, organization and places. The core of the semantic enrichment process is expert.ai software. Expert.ai uses a proprietary semantic network, where words are grouped into concepts with other words sharing the same meaning, and the concepts are related between them by linguistic relations such as hypernyms or hyponyms among many others. Therefore, the semantics of the generated metadata is explicit since the concepts are grounded to the semantic network.

Information retrieval processes, including search engines and recommendendation systems, can benefit of working with concepts instead of character strings representing words, mainly to provide a more complete and accurate set of results, and enabling the exploration of file and research object collections by means of facets where the semantic metadata is available.

Document Enrichment

The files must be of any of the following types: Word documents, PDF documents, Text files, or PowerPoint Presentations. All these pieces of text are fed into expert.ai to generate the metadata representing their text content. Expert.ai is able to identify the following metadata types in the text:

     •   Domains
     •   Main Concepts
     •   Main Lemmas
     •   Main Expressions
     •   Named entities: all the named entities found in the text classified into People, Organizations and Places.

All these metadata types are added to the response as annotations in a json file. Below an example of how the service can be called and the results that it provides is presented.

API example

curl -X POST -F '[email protected]' https://reliance.expertcustomers.ai/eosc/enrichment




Enriched Research Objects

ROHub uses the Enrichment Service to enrich research objects. You can find some examples of ROs and the semantic annotations extracted by our service down below:

Try it out

The Search index used by the search service hosts the collection of research objects from the ROHub platform which have been previously enriched. These annotations, added to the original metadata of the research object, are leveraged to produce more accurate results and to provide new facets to explore the research object collection. This index also serves as core for the recommendation api, which returns recommended research objects from this collection. So, the goal of this api is to improve the exploration of the research object collection hosted by ROHub and to allow the users to make facet and semantic searching over them based on their text content.

The Search index follows a scheme which acommodates the metadata obtained from the ROHub platform and the annotations generated by the enrichment api. It has six different facets: Concepts (most frequent concepts mentioned in the text), Expressions (Most relevant phrases or collocations found in the text), Domains (fields of knowledge in which the main concepts are most commonly used), People, Places and Organizations. These facet fields, along with the rest of documents hosted by the index, are updated every time a research object is created or updated in ROHub. Moreover, each indexed document has attached other related information as the title, the description or the creator of the Research Object, which can be accessed through the Search API.

Search API

The index is built on a Solr version 8.9.0, and can be accessed using the SolrJ API or sending queries right to the service. To do it so, it is necessary to login with an standard user account, which allows to do search queries to the index. More information about how the Solr API works or which type of queries can be sent to this service can be found on the official Solr site. Click on some of their tutorials down below if you want to learn more:

Search for a single term

Field Searches

Phrase Search

Combining Searches

And for more advanced tutorials, click here:

Common Query Parameters

The Standard Query Parser

Query example

The result of the following query is a json document with the research objects which contains the word "inSAR":
curl --user standard_user:standard_user -H "Content-Type: application/json" https://reliance.expertcustomers.ai/solr/ROHub/select?q=inSAR



Facet Query example

The result of the following query is a json document with the research objects which have "Augustine Volcano" as one of their mentioned places:
curl --user standard_user:standard_user -H "Content-Type: application/json" "https://reliance.expertcustomers.ai/solr/ROHub/select?fq=place:Augustine%20Volcano&q=*:*"



Recommendation

Try it out

The recommendation system suggests research objects that might be of interest according to user’s research interests. The recommendation system follows a content-based approach in the sense that it compares the research object content with the user interest to draw the list of recommended items. This comparison is based on the annotations added by the semantic enrichment process. The user interests are identified from the top concepts in the user’s research objects. These concepts are then compared with the concepts that annotate the research objects in the whole collection. The user interest can be increased by i) adding specific research objects from other users or ii) adding a different scientist. In the former case the main concepts of the research object are added to the user’s interests and in the latter case the scientist interests are added to the user’s interests. The recommendation system has a rest API and a web user interface called Collaboration Spheres.

Recommendation API

The recommendation service rest api accepts post requests and returns a json document with the list of research objects that make up the recommendation. The service is currently deployed in: http://reliance.expertcustomers.ai/spheresbackend/services/jsonservices/api. To include research objects or scientist in the recommendation context the service accepts a json document of the form {“ros”:[“uri-1”,...], “scientists”:[“uri-2”,...]} where the element “ros” is an array containing the list of uris corresponding to the research objects that will be added to the recommendation context and the element scientist is an array containing the list of uris corresponding to the users that will be added to the recommendation context. To be consistent with definition of context in the collaboration spheres a maximum of three uris, either research objects, users or a combination of both, can be added to the recommendation context. Below an example of how the service can be call and the results that it provides is presented.

API example

curl -d '{"ros": ["https://w3id.org/ro-id/038179f2-f2dc-4cd6-a8ab-28765fb35950"], "scientists":[]}' -H "Content-Type: application/json" -X POST https://reliance.expertcustomers.ai/spheresbackend/services/jsonservices/api



Comprehension

Claim Analysis

We understand as a scientific claim a statement that can be verified within the scientific literature. It can be an assertion about a specific scientific subject such as “The life cycle of ferns is characterized by two phases: gametophyte and sporophyte”, and it should be verifiable using a reliable and contrasted source. Our goal is then to use claims to connect research objects and scientific publications. The claim analysis pipeline extracts, compares, and connects claims between research objects and scholarly communications in an external scientific repository. The input of this pipeline is the description of a research object from ROHub, and the output is one or more pairs of claims: one claim from the research object and another from the database.

API example

curl -d 'Coronavirus disease 2019 (COVID-19), first reported in Wuhan, the capital of Hubei, China, has been associated to a novel coronavirus, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). In March 2020, the World Health Organization declared the SARS-CoV-2 infection a global pandemic. Soon after, the number of cases soared dramatically, spreading across China and worldwide. Italy has had 12,462 confirmed cases according to the Italian National Institute of Health (ISS) as of March 11, and after the 'lockdown' of the entire territory, by May 4, 209,254 cases of COVID-19 and 26,892 associated deaths have been reported. We performed a review to describe, in particular, the origin and the diffusion of COVID-19 in Italy, underlying how the geographical circulation has been heterogeneous and the importance of pathophysiology in the involvement of cardiovascular and neurological clinical manifestations. ' -H "Content-Type: text/plain" -X POST https://reliance.expertcustomers.ai/extended/claim_analysis



Challenges and Solutions

This service extracts the main challenge and the main proposal given the title and the description of a research object. To achieve this, we fine-tuned a language model to classify a sentence as challenge, solution, or none, using a dataset containing texts from an innovation management platform12 annotated with problems and solutions. The dataset13 was annotated by a group of 7 annotators, has a total of 300 texts, and a subset of 20 texts were annotated by all the annotators so that the inter-rater agreement could be calculated as quality metric of the annotation process.

API example

curl -d '{"title": "Further to the Left: Stress-Induced Increase of Spatial Pseudoneglect During the COVID-19 Lockdown", "description": "Background The measures taken to contain the coronavirus disease 2019 (COVID-19) pandemic, such as the lockdown in Italy, do impact psychological health; yet less is known about their effect on cognitive functioning..."} ' -H "Content-Type: application/json" -X POST https://reliance.expertcustomers.ai/extended/csextractor



Novelty Score

This service calculates the novelty score of a research object. If a research object is similar to an existing research work (in the collection of research objects or in the publications), the novelty score would be very low and vice versa. The range of similarities is between 0 and 1, and consequently the range of the novelty score is between 0 and 100.

API example

curl -d '{"id": "https://w3id.org/ro-id/6bc62582-2f11-4983-a51c-0af32459eca6"} ' -H "Content-Type: application/json" -X POST https://reliance.expertcustomers.ai/extended/novelty_calculation



Question Generation

This service helps users to better understand the content of a research object by challenging them to test their understanding and providing answers to the proposed questions. Starting with the research object id we retrieve the title and description from the elastic search index. Then this text is used to generate the questions and the answers, by calling the question generation (QG) and question answering (QA) modules. All the questions and answers are finally returned to the user.

API example

curl -d '{"id": "https://w3id.org/ro-id/9bf840e3-7a39-41fc-be39-7eed9dc294db"} ' -H "Content-Type: application/json" -X POST https://reliance.expertcustomers.ai/extended/question_generation



More materials

EGI Notebook Tutorial

Learn how to invoke our APIs with the Jupyter Notebook we have released in EGI.

It is available under datahub/Reliance/Text_Mining_Tutorial/