Using Natural Language Processing to Automate Screening of Abstracts for Neurosurgical Guideline Creation
Correction of the effect of direct oral and parenteral anticoagulants in hemorrhagic stroke
Characterizing discharge opioid prescription in postoperative neurosurgical patients: a systematic review
Surviving the Big One: Turkish Neurosurgical Society's Innovative Disaster Management Model for Istanbul Earthquake
Risk factors for neurosurgical intervention within 48 hours of admission for patients with mild traumatic brain injury and isolated subdural hematoma
A collaborative multidisciplinary trauma program improvement team improves VTE chemoprophylaxis guideline compliance in non-operative stable TBI
Interdisciplinary neurovascular conference-opportunities and benefits
New neuroanatomy learning paradigms for the next generation of trainees: A novel literature-based 3D methodology

The body of neurosurgical literature has grown exponentially, with publication rates increasing year-over-year. Manual screening of abstracts for systematic review creation and guideline formation has become arduous because of the mass of literature. Natural Language Processing, namely, large language models (LLMs), has shown promise in automating the abstract screening process. Nitturi et al. evaluated whether Gemini Pro and ChatGPT, two LLMs, can automate the screening of abstracts for a guideline created by the Congress of Neurological Surgeons.

They developed novel pipelines using Gemini Pro and ChatGPT-4o-mini to screen abstracts for guideline creation. We tested our pipeline using abstracts from the EMBASE search term provided in a Congress of Neurological Surgeons guideline on Chiari I malformations for a single population, intervention, comparison, and outcome question. We used only two inclusion/exclusion criteria and inputted a simplified version of the research question investigated.

Of the 1764 abstracts generated from the search, 22 were manually chosen to be relevant for guideline creation. Using Gemini Pro, 1043 articles were correctly excluded, and only 1 was incorrectly excluded, resulting in a sensitivity of 95% and a specificity of 60%. Using ChatGPT-4o-mini, 1066 articles were correctly excluded, but only 4 articles were correctly included, resulting in a sensitivity of 18% and a specificity of 95%. Both pipelines completed the screening process in under 1 hour.

They developed novel LLM pipelines to automate abstract screening for neurosurgical guideline creation. This technology can reduce the time necessary for abstract screening processes from several weeks to a few hours. While further validation is required, this process could pave the way for evidence-based guidelines to be continuously updated in real time across medical fields ¹⁾.

Critical Review: NLP-Assisted Abstract Screening for Neurosurgical Guidelines

Reference: Nitturi V, Flores A, Bauer DF. *Using Natural Language Processing to Automate Screening of Abstracts for Neurosurgical Guideline Creation*. Neurosurgery. 2025 Apr 14. doi: 10.1227/neu.0000000000003450. PMID: 40227043.

1. Context and Relevance

The exponential growth of neurosurgical literature has made manual abstract screening for systematic reviews and guideline creation increasingly burdensome. The authors propose using Large Language Models (LLMs)—specifically Gemini Pro and ChatGPT-4o-mini—to automate this process for a Congress of Neurological Surgeons (CNS) guideline on Chiari I malformations.

✔ Strength: Addresses a real and growing challenge in evidence-based neurosurgery.
✘ Limitation: Tested on a single PICO question, limiting generalizability.

The authors created two NLP pipelines to screen 1,764 abstracts retrieved from EMBASE. Only 22 were manually identified as relevant. Two inclusion/exclusion criteria were applied using simplified research question prompts.

✔ Strengths:
- Real-world dataset from a recognized guideline.
- Sensitivity and specificity reported.
- Comparison between LLMs is informative.

✘ Limitations:
- Simplified criteria may not reflect clinical complexity.
- Lack of prompt engineering transparency.
- Use of ChatGPT-4o-mini instead of full GPT-4 could bias outcomes.

Model	Sensitivity	Specificity	Notes
Gemini Pro	95%	60%	High recall, but lower precision.
ChatGPT-4o-mini	18%	95%	High precision, but misses many relevant items.

✔ Interpretation:
- Gemini Pro suitable for early-phase screening to avoid missing key studies.
- ChatGPT-4o-mini may reduce manual workload but risks underinclusion.

✘ Concern: ChatGPT’s 18% sensitivity is clinically insufficient for guideline formulation.

✔ Positive:
- Screening time reduced from weeks to < 1 hour.
- Demonstrates feasibility of LLMs in medical workflows.
- Opens the door to real-time “living” guidelines.

✘ Limitations:
- No external validation on other neurosurgical topics.
- Lacks interface or integration strategy.
- LLMs not yet safe for autonomous use.

Criterion	Rating (0–10)
Relevance	9
Methodological Rigor	6
Innovation	8
Generalizability	5
Clinical Applicability	4
Future Potential	9
Overall Score	6.8 / 10

This work is a promising first step toward automating systematic review processes in neurosurgery. However, current LLM performance, especially in sensitivity, prevents full clinical integration. Broader validation, better prompt design, and user-centric tools are needed before such systems can safely support or replace manual screening in high-stakes guideline creation.

To design and document effective prompts for supporting the drafting, screening, and validation of neurosurgical clinical practice guidelines using large language models (LLMs) such as ChatGPT, Gemini, or Claude.

🔍 Systematic Review and Abstract Screening

Prompt 1:

Given the following PICO question and inclusion/exclusion criteria, identify whether each abstract should be included in a systematic review for neurosurgical guidelines. Justify the decision.

Prompt 2:

Summarize the level of evidence, population size, study type, and outcome relevance for each abstract listed below.

Prompt 3:

From this batch of abstracts, extract those that meet Level I or Level II evidence for surgical intervention in [condition].

Prompt 4:

Draft evidence-based recommendations using GRADE methodology based on these summarized studies. Format each as 'We recommend' or 'We suggest'.

Prompt 5:

Convert these study summaries into formal guideline statements for neurosurgeons, including strength of recommendation and level of evidence.

Prompt 6:

Evaluate the methodological quality and bias risk of the following studies using the ROBINS-I or Cochrane Risk of Bias tool.

Prompt 7:

Compare and synthesize the findings from these RCTs and cohort studies on surgical versus conservative treatment of [condition].

Prompt 8:

Extract key data fields (author, year, study type, n, intervention, comparator, outcome, results, level of evidence) from the following study abstracts.

Prompt 9:

Turn this list of abstracts into a tabular summary suitable for inclusion in a neurosurgical guideline evidence matrix.

📚 Clinical Contextualization

Prompt 10:

Translate these guideline recommendations into clinical practice tips for low-resource settings.

Prompt 11:

Provide a brief critique of current guideline gaps in [specific neurosurgical area], based on latest evidence.

Prompt 12:

You are a GPT-powered assistant tasked with screening abstracts for inclusion in a neurosurgical guideline. Apply these inclusion/exclusion rules and explain each decision in under 40 words.

Adapt prompts based on the specific PICO question and target population.
Consider comparative testing across multiple models (GPT-4, Gemini Pro, Claude).
Log false positives/negatives for refining instruction clarity.

Congress of Neurological Surgeons Methodology
GRADE Handbook
PRISMA Guidelines for Systematic Reviews

¹⁾

Nitturi V, Flores A, Bauer DF. Using Natural Language Processing to Automate Screening of Abstracts for Neurosurgical Guideline Creation. Neurosurgery. 2025 Apr 14. doi: 10.1227/neu.0000000000003450. Epub ahead of print. PMID: 40227043.

Neurosurgical Guidelines Creation

Critical Review: NLP-Assisted Abstract Screening for Neurosurgical Guidelines

1. Context and Relevance

2. Methodology

3. Results

4. Technical and Practical Impact

5. Overall Assessment

6. Conclusion

Prompt Library for Neurosurgical Guideline Creation

🧠 Purpose

📘 Prompt Categories

🔍 Systematic Review and Abstract Screening

📘 Guideline Drafting / Recommendation Wording

🧠 Critical Appraisal and Meta-Synthesis

📊 Data Extraction and Evidence Tables

📚 Clinical Contextualization

🤖 LLM Testing and Evaluation

🧾 Notes

🔗 References

Neurosurgery Wiki