Scientific Literature Generation II

Yassine Kaddi
3 min readJan 13, 2022

I’ve explained here how scientific literature generation was working in the first versions of naimai. As pointed out, in these first versions (v0.1 and v0.2), I was analyzing all the article text, extracting the sentences with the paper objective and reformulating them. Hence papers processing (briefly explained here) was pretty slow . Besides, the total number of articles analyzed was about 15 000 and only few fields were considered (which is also a consequence of the slowness of papers processing). In this new version (v1.0), I analyzed only abstracts instead of the whole text and I’ve improved the processing part (BERT model instead of TfIdf as explained here) as well as the text generation part. Besides, I’ve added more papers, ending up with a total number of about 540 000 papers spread over 14 fields (detailed below). By the way, I’ve created a facebook page for this algorithm : https://www.facebook.com/naimai4science.

Let me start off by reminding you briefly how naimai works..

How it works

First of all, you choose the field of interest in the lateral menu. You might also select the order of reviews (by relevance or by date of publication) and the n° of references desired. Then you look up your keywords (in english) in the search area. Unlike the preceding versions where the search results were based on TfIdf method, they are now based on a semantic search algorithm (that understands better the keywords entered) using a language model called BERT model (Sentence-transformers for the curious ones). Hence, the results are expected to be more relevant. Just a small detail : abbreviations were not taken into account while processing, so it’s better to exclude abbreviations when searching. This point will be considered in the next version (I promess !!).

Data : Papers and Fields

In this version, I’ve used open access papers data from ArXiv (here), from Elsevier (here) and some from BioXiv (here). I ended up with about 540 000 papers in total and thus, with more fields compared to the preceding versions. I tried to separate the fields so I can account for many areas of science and merge all different papers, and I must admit that it was tough since some fields might have many things in common. Hopefully this fields separation is fine. The fields are broken down as follow (the number on the bars is the total number of papers in the field) :

The subfields considered in each field can be found in “Fields information..” in the lateral menu on naimai.

Next

In next versions, I’ll start by adding much more papers (hopefully in the fields with fewer data). I’ll try to improve text generation and extract the method and the results of papers along with the paper objective. I might also try to get reviews not only in the chosen field but even in other fields that might have relevant results (instead of being stuck in one field). That being said, I’ll be happy to have your thoughts about the work :)

--

--