Scientific Literature Review Generation
Introduction
During my PhD, I often needed to do a literature review to get my head around the topics I was working on. This could be time-consuming (I even felt sleepy sometimes..). Ask any PhD student (or scientific generally) and they’ll tell you that scientific paper reviews to scientifics are what Jesus is to christians!
Two cases are presented when I’m looking for a topic in literature : either this topic is intensively explored (so I end up with tons of papers to read) or the topic is rarely explored and a few papers are published.. Sometimes, there can be an interesting work carried out somewhere but it can be hard to find (if it’s not published in a well-known journal or not so cited for example).
Since the lazy in me likes to automate things, and since I’m a natural language processing (NLP) passionate, I developed an algorithm that generates a review from a bunch of papers, that I called NaimAI (actually pronounced Naymay..). This algorithm is deployed in naimai.fr (sometimes be offline for upgrading..), and my goal in this article is to explain the main ideas behind the curtain.
Objective
The main objective of NaimAI is to generate a review from a number of papers : Say I’m new to hydraulic modelling and want to have an idea about one dimensional modelling of inundations. The old way would be to :
- Find the papers : ask people around working on the same topic about the hot papers, look up on Google or other websites (Google Scholar, editors websites..), look the references of the papers I’ve read..
- Read the papers : I generally start with the abstract & conclusion (maybe quickly the introduction) to try and guess if the paper would be interesting for me. Needless to say that this step takes a bit of time..
The problem here is that I might spend time on papers that couldn’t be that useful, and I often end up choosing a few numbers of them. I’d rather spend this lost time on looking for more interesting papers for example.
With this algorithm, my goal is to save time and get quickly a global idea about the work being done by generating a review. That way we can easily target the papers we want and save time.
The idea
The main idea is simple :
- Inputs : the papers (N papers in the figure above) and the query.
- PDFs processing : this step consists mainly of reading the PDFs, cleaning the text and getting it ready for the classification step.
- PDFs Classification : the processed papers are classified based on a given query. Two methods are used in NaimAI : Doc2vec method and the Tf Idf method (scikit learn library). The deployed version (in naimai.fr) uses only Tf Idf method.
- Text generation : Once the papers are classified, NaimAI identifies for each paper : (1) the authors name, (2) the publication year and (3) the sentences where the paper’s objective is stated. Then, the review phrase is generated for each paper (example : X et al. 2021 showed that … Y et al. 1999 worked on… ). That being said, if it fails to identify the author’s name, you might end up with something weird, but as long as the references are cited (or could be downloaded), you’ll find out if the algorithm failed. This part will be enhanced in next versions.
NaimAI
So the steps described above are implemented in an algorithm deployed using Django here. I’ve already processed a thousands of papers in many fields (mainly from arXiv). All these fields can be found in the Field menu :
As you would have probably noticed, you can define the size of phrases (short or long) as well as number of references in the menu, before searching.
Next
There are certainly so many areas to enhance/add : authors name identification, review your custom papers, the processing itself, text generation etc. Also, I’d like the model to identify not only the objective but also the method and the results, thus enabling a more precise review for methods and results too.
I’ll fix these problems progressively in the next versions, even though I’m quite curious to use some more advanced techniques for text generation (bi Transformers models with attention mechanism)..
Hopefully that’d be useful. In these first versions, you should expect some weird (and funny) results, but that’d be fixed as soon as possible!