The body of neurosurgical literature has grown exponentially, with publication rates increasing year-over-year. Manual screening of abstracts for systematic review creation and guideline formation has become arduous because of the mass of literature. Natural Language Processing, namely, large language models (LLMs), has shown promise in automating the abstract screening process. Nitturi et al. evaluated whether Gemini Pro and ChatGPT, two LLMs, can automate the screening of abstracts for a guideline created by the Congress of Neurological Surgeons.
They developed novel pipelines using Gemini Pro and ChatGPT-4o-mini to screen abstracts for guideline creation. We tested our pipeline using abstracts from the EMBASE search term provided in a Congress of Neurological Surgeons guideline on Chiari I malformations for a single population, intervention, comparison, and outcome question. We used only two inclusion/exclusion criteria and inputted a simplified version of the research question investigated.
Of the 1764 abstracts generated from the search, 22 were manually chosen to be relevant for guideline creation. Using Gemini Pro, 1043 articles were correctly excluded, and only 1 was incorrectly excluded, resulting in a sensitivity of 95% and a specificity of 60%. Using ChatGPT-4o-mini, 1066 articles were correctly excluded, but only 4 articles were correctly included, resulting in a sensitivity of 18% and a specificity of 95%. Both pipelines completed the screening process in under 1 hour.
They developed novel LLM pipelines to automate abstract screening for neurosurgical guideline creation. This technology can reduce the time necessary for abstract screening processes from several weeks to a few hours. While further validation is required, this process could pave the way for evidence-based guidelines to be continuously updated in real time across medical fields 1).
Reference: Nitturi V, Flores A, Bauer DF. *Using Natural Language Processing to Automate Screening of Abstracts for Neurosurgical Guideline Creation*. Neurosurgery. 2025 Apr 14. doi: 10.1227/neu.0000000000003450. PMID: 40227043.
The exponential growth of neurosurgical literature has made manual abstract screening for systematic reviews and guideline creation increasingly burdensome. The authors propose using Large Language Models (LLMs)βspecifically Gemini Pro and ChatGPT-4o-miniβto automate this process for a Congress of Neurological Surgeons (CNS) guideline on Chiari I malformations.
The authors created two NLP pipelines to screen 1,764 abstracts retrieved from EMBASE. Only 22 were manually identified as relevant. Two inclusion/exclusion criteria were applied using simplified research question prompts.
Model | Sensitivity | Specificity | Notes |
---|---|---|---|
Gemini Pro | 95% | 60% | High recall, but lower precision. |
ChatGPT-4o-mini | 18% | 95% | High precision, but misses many relevant items. |
Criterion | Rating (0β10) |
---|---|
Relevance | 9 |
Methodological Rigor | 6 |
Innovation | 8 |
Generalizability | 5 |
Clinical Applicability | 4 |
Future Potential | 9 |
Overall Score | 6.8 / 10 |
This work is a promising first step toward automating systematic review processes in neurosurgery. However, current LLM performance, especially in sensitivity, prevents full clinical integration. Broader validation, better prompt design, and user-centric tools are needed before such systems can safely support or replace manual screening in high-stakes guideline creation.
To design and document effective prompts for supporting the drafting, screening, and validation of neurosurgical clinical practice guidelines using large language models (LLMs) such as ChatGPT, Gemini, or Claude.
Given the following PICO question and inclusion/exclusion criteria, identify whether each abstract should be included in a systematic review for neurosurgical guidelines. Justify the decision.
Summarize the level of evidence, population size, study type, and outcome relevance for each abstract listed below.
From this batch of abstracts, extract those that meet Level I or Level II evidence for surgical intervention in [condition].
Draft evidence-based recommendations using GRADE methodology based on these summarized studies. Format each as 'We recommend' or 'We suggest'.
Convert these study summaries into formal guideline statements for neurosurgeons, including strength of recommendation and level of evidence.
Evaluate the methodological quality and bias risk of the following studies using the ROBINS-I or Cochrane Risk of Bias tool.
Compare and synthesize the findings from these RCTs and cohort studies on surgical versus conservative treatment of [condition].
Extract key data fields (author, year, study type, n, intervention, comparator, outcome, results, level of evidence) from the following study abstracts.
Turn this list of abstracts into a tabular summary suitable for inclusion in a neurosurgical guideline evidence matrix.
Translate these guideline recommendations into clinical practice tips for low-resource settings.
Provide a brief critique of current guideline gaps in [specific neurosurgical area], based on latest evidence.
You are a GPT-powered assistant tasked with screening abstracts for inclusion in a neurosurgical guideline. Apply these inclusion/exclusion rules and explain each decision in under 40 words.