Neurosurgical Guidelines Creation
The body of neurosurgical literature has grown exponentially, with publication rates increasing year-over-year. Manual screening of abstracts for systematic review creation and guideline formation has become arduous because of the mass of literature. Natural Language Processing, namely, large language models (LLMs), has shown promise in automating the abstract screening process. Nitturi et al. evaluated whether Gemini Pro and ChatGPT, two LLMs, can automate the screening of abstracts for a guideline created by the Congress of Neurological Surgeons.
They developed novel pipelines using Gemini Pro and ChatGPT-4o-mini to screen abstracts for guideline creation. We tested our pipeline using abstracts from the EMBASE search term provided in a Congress of Neurological Surgeons guideline on Chiari I malformations for a single population, intervention, comparison, and outcome question. We used only two inclusion/exclusion criteria and inputted a simplified version of the research question investigated.
Of the 1764 abstracts generated from the search, 22 were manually chosen to be relevant for guideline creation. Using Gemini Pro, 1043 articles were correctly excluded, and only 1 was incorrectly excluded, resulting in a sensitivity of 95% and a specificity of 60%. Using ChatGPT-4o-mini, 1066 articles were correctly excluded, but only 4 articles were correctly included, resulting in a sensitivity of 18% and a specificity of 95%. Both pipelines completed the screening process in under 1 hour.
They developed novel LLM pipelines to automate abstract screening for neurosurgical guideline creation. This technology can reduce the time necessary for abstract screening processes from several weeks to a few hours. While further validation is required, this process could pave the way for evidence-based guidelines to be continuously updated in real time across medical fields 1).
Critical Review: NLP-Assisted Abstract Screening for Neurosurgical Guidelines
Reference: Nitturi V, Flores A, Bauer DF. *Using Natural Language Processing to Automate Screening of Abstracts for Neurosurgical Guideline Creation*. Neurosurgery. 2025 Apr 14. doi: 10.1227/neu.0000000000003450. PMID: 40227043.
1. Context and Relevance
The exponential growth of neurosurgical literature has made manual abstract screening for systematic reviews and guideline creation increasingly burdensome. The authors propose using Large Language Models (LLMs)βspecifically Gemini Pro and ChatGPT-4o-miniβto automate this process for a Congress of Neurological Surgeons (CNS) guideline on Chiari I malformations.
- β Strength: Addresses a real and growing challenge in evidence-based neurosurgery.
- β Limitation: Tested on a single PICO question, limiting generalizability.
2. Methodology
The authors created two NLP pipelines to screen 1,764 abstracts retrieved from EMBASE. Only 22 were manually identified as relevant. Two inclusion/exclusion criteria were applied using simplified research question prompts.
- β Strengths:
- Real-world dataset from a recognized guideline.
- Sensitivity and specificity reported.
- Comparison between LLMs is informative.
- β Limitations:
- Simplified criteria may not reflect clinical complexity.
- Lack of prompt engineering transparency.
- Use of ChatGPT-4o-mini instead of full GPT-4 could bias outcomes.
3. Results
Model | Sensitivity | Specificity | Notes |
---|---|---|---|
Gemini Pro | 95% | 60% | High recall, but lower precision. |
ChatGPT-4o-mini | 18% | 95% | High precision, but misses many relevant items. |
- β Interpretation:
- Gemini Pro suitable for early-phase screening to avoid missing key studies.
- ChatGPT-4o-mini may reduce manual workload but risks underinclusion.
- β Concern: ChatGPTβs 18% sensitivity is clinically insufficient for guideline formulation.
4. Technical and Practical Impact
- β Positive:
- Screening time reduced from weeks to < 1 hour.
- Demonstrates feasibility of LLMs in medical workflows.
- Opens the door to real-time βlivingβ guidelines.
- β Limitations:
- No external validation on other neurosurgical topics.
- Lacks interface or integration strategy.
- LLMs not yet safe for autonomous use.
5. Overall Assessment
Criterion | Rating (0β10) |
---|---|
Relevance | 9 |
Methodological Rigor | 6 |
Innovation | 8 |
Generalizability | 5 |
Clinical Applicability | 4 |
Future Potential | 9 |
Overall Score | 6.8 / 10 |
6. Conclusion
This work is a promising first step toward automating systematic review processes in neurosurgery. However, current LLM performance, especially in sensitivity, prevents full clinical integration. Broader validation, better prompt design, and user-centric tools are needed before such systems can safely support or replace manual screening in high-stakes guideline creation.
Prompt Library for Neurosurgical Guideline Creation
π§ Purpose
To design and document effective prompts for supporting the drafting, screening, and validation of neurosurgical clinical practice guidelines using large language models (LLMs) such as ChatGPT, Gemini, or Claude.
π Prompt Categories
π Systematic Review and Abstract Screening
- Prompt 1:
Given the following PICO question and inclusion/exclusion criteria, identify whether each abstract should be included in a systematic review for neurosurgical guidelines. Justify the decision.
- Prompt 2:
Summarize the level of evidence, population size, study type, and outcome relevance for each abstract listed below.
- Prompt 3:
From this batch of abstracts, extract those that meet Level I or Level II evidence for surgical intervention in [condition].
π Guideline Drafting / Recommendation Wording
- Prompt 4:
Draft evidence-based recommendations using GRADE methodology based on these summarized studies. Format each as 'We recommend' or 'We suggest'.
- Prompt 5:
Convert these study summaries into formal guideline statements for neurosurgeons, including strength of recommendation and level of evidence.
π§ Critical Appraisal and Meta-Synthesis
- Prompt 6:
Evaluate the methodological quality and bias risk of the following studies using the ROBINS-I or Cochrane Risk of Bias tool.
- Prompt 7:
Compare and synthesize the findings from these RCTs and cohort studies on surgical versus conservative treatment of [condition].
π Data Extraction and Evidence Tables
- Prompt 8:
Extract key data fields (author, year, study type, n, intervention, comparator, outcome, results, level of evidence) from the following study abstracts.
- Prompt 9:
Turn this list of abstracts into a tabular summary suitable for inclusion in a neurosurgical guideline evidence matrix.
π Clinical Contextualization
- Prompt 10:
Translate these guideline recommendations into clinical practice tips for low-resource settings.
- Prompt 11:
Provide a brief critique of current guideline gaps in [specific neurosurgical area], based on latest evidence.
π€ LLM Testing and Evaluation
- Prompt 12:
You are a GPT-powered assistant tasked with screening abstracts for inclusion in a neurosurgical guideline. Apply these inclusion/exclusion rules and explain each decision in under 40 words.
π§Ύ Notes
- Adapt prompts based on the specific PICO question and target population.
- Consider comparative testing across multiple models (GPT-4, Gemini Pro, Claude).
- Log false positives/negatives for refining instruction clarity.
π References
- Congress of Neurological Surgeons Methodology
- GRADE Handbook
- PRISMA Guidelines for Systematic Reviews