Screening Tool
Page Contents
- Purpose and Capabilities: what the Screening tool does and why it's essential
- Usage Methods: how to use the tool across different platforms and programming languages
- Configuration File Structure: detailed explanation of the TOML configuration
- Screening Filters: available filters and their options
- Input and Output Formats: supported file formats and data structures
- Best Practices: recommendations for effective screening
- Workflow Integration: how the Screening tool fits into your systematic review process
- Troubleshooting: solutions to common issues
Purpose and Capabilities
The prismAId Screening tool automates the filtering phase of systematic literature reviews by identifying and tagging manuscripts for potential exclusion. This critical step occurs after the initial literature search but before downloading full texts, helping researchers focus on relevant literature by:
- Deduplication: Identifies and removes duplicate manuscripts using various matching algorithms
- Language Filtering: Detects manuscript language and filters based on accepted languages
- Article Type Classification: Identifies article types (research articles, reviews, editorials, etc.) for selective inclusion/exclusion
- Topic Relevance: Scores manuscripts based on topic relevance to identify non relvant (off-topic) manuscripts
- Batch Processing: Efficiently processes large volumes of manuscripts with minimal manual intervention
- Transparent Tagging: Provides clear reasons for exclusions and maintains complete audit trails
- AI-Assisted Analysis: Optional integration with LLMs for enhanced classification accuracy
The Screening tool bridges the gap between literature search and paper acquisition, ensuring that only relevant, unique manuscripts are downloaded and proceed to the full review phase.

Usage Methods
The Screening tool can be accessed through multiple interfaces to accommodate different workflows:
Binary (Command Line)
# Run screening with a TOML configuration file
./prismaid -screening screening_config.toml
Go Package
import "github.com/open-and-sustainable/prismaid"
// Run screening with a TOML configuration string
tomlConfig := "..." // Your TOML configuration as a string
err := prismaid.Screening(tomlConfig)
Python Package
import prismaid
# Run screening with a TOML configuration file
with open("screening_config.toml", "r") as file:
toml_config = file.read()
prismaid.screening(toml_config)
R Package
library(prismaid)
# Run screening with a TOML configuration file
toml_content <- paste(readLines("screening_config.toml"), collapse = "\n")
Screening(toml_content)
Julia Package
using PrismAId
# Run screening with a TOML configuration file
toml_config = read("screening_config.toml", String)
PrismAId.screening(toml_config)
Configuration File Structure
The Screening tool is driven by a TOML configuration file that defines all aspects of your screening process. Here’s the complete structure:
Project Section
[project]
name = "Manuscript Screening Example" # Project title
author = "John Doe" # Project author
version = "1.0" # Configuration version
input_file = "/path/to/manuscripts.csv" # Input CSV or TXT file
output_file = "/path/to/results" # Output path (without extension)
text_column = "abstract" # Column with text/file paths
identifier_column = "doi" # Column with unique IDs
output_format = "csv" # "csv" or "json"
log_level = "medium" # "low", "medium", or "high"
Filters Section
The filters section controls which screening criteria to apply:
[filters]
[filters.deduplication]
enabled = true
use_ai = false # Use AI for similarity detection
compare_fields = ["title", "abstract", "doi", "authors", "year"] # Fields to compare for duplication
[filters.language]
enabled = true
accepted_languages = ["en", "es", "fr"] # ISO language codes
use_ai = false # Use AI for detection (recommended for better accuracy)
[filters.article_type]
enabled = true
use_ai = false # Use AI for classification (requires LLM config)
# Traditional publication type exclusions
exclude_reviews = true # Exclude all review types (review, systematic_review, meta_analysis)
exclude_editorials = true # Exclude editorials
exclude_letters = true # Exclude letters to editor
exclude_case_reports = false # Exclude case reports
exclude_commentary = false # Exclude commentary articles
exclude_perspectives = false # Exclude perspective articles
# Methodological type exclusions (can overlap with publication types)
exclude_theoretical = false # Exclude theoretical/conceptual papers
exclude_empirical = false # Exclude empirical studies with data
exclude_methods = false # Exclude methods/methodology papers
# Study scope exclusions (applies to empirical studies)
exclude_single_case = false # Exclude single case studies (n=1, individual cases)
exclude_sample = false # Exclude sample studies (cohorts, cross-sectional, multiple subjects)
include_types = [] # If specified, ONLY include these types
# Available types: "research_article", "review", "systematic_review",
# "meta_analysis", "editorial", "letter", "case_report", "commentary",
# "perspective", "empirical_study", "theoretical_paper", "methods_paper",
# "single_case_study", "sample_study"
[filters.topic_relevance]
enabled = false # Enable topic relevance filtering
use_ai = false # Use AI for semantic relevance scoring
topics = [] # List of topic descriptions
# Example: ["machine learning in healthcare",
# "artificial intelligence for medical diagnosis"]
min_score = 0.5 # Minimum relevance score (0.0-1.0)
[filters.topic_relevance.score_weights]
keyword_match = 0.4 # Weight for keyword matching
concept_match = 0.4 # Weight for concept/phrase matching
field_relevance = 0.2 # Weight for journal/field relevance
LLM Configuration (Optional)
For AI-assisted screening:
[filters.llm.1]
provider = "OpenAI" # AI provider
api_key = "" # API key (uses env if empty)
model = "gpt-4o-mini" # Model name
temperature = 0.01 # Model temperature
tpm_limit = 0 # Tokens per minute limit
rpm_limit = 0 # Requests per minute limit
Screening Filters
The screening tool includes four main filters that can be applied in sequence:
- Deduplication Filter - Identifies and removes duplicate manuscripts
- Language Detection Filter - Filters manuscripts by language
- Article Type Classification Filter - Classifies and filters by publication type
- Topic Relevance Filter - Scores manuscripts based on topic relevance
Each filter has detailed documentation available through the links above. Below is a brief overview of each filter’s capabilities.
Deduplication Filter
Identifies duplicate manuscripts using intelligent field comparison or AI-assisted semantic matching. See full documentation.
Language Detection Filter
Identifies manuscript language and filters based on accepted languages using rule-based pattern matching or AI-assisted semantic detection. See full documentation.
Article Type Classification Filter
Classifies manuscripts into multiple overlapping categories (traditional types, methodological types, and study scope). A single manuscript can belong to several types simultaneously. See full documentation.
Topic Relevance Filter
Scores manuscripts based on their relevance to user-specified research topics using keyword matching, concept matching, and field relevance analysis. See full documentation.
Filter Interaction and Processing Order
The screening tool applies filters sequentially, which optimizes performance and ensures clear exclusion tracking:
Processing Pipeline
Input Manuscripts List (CSV, TXT)
↓
[1] DEDUPLICATION FILTER
├─ Identifies duplicates
├─ Marks with: tag_is_duplicate=true
└─ Sets: include=false, exclusion_reason="Duplicate of [ID]"
↓
[2] LANGUAGE FILTER
├─ Skips already excluded records
├─ Detects language (title priority)
└─ Excludes non-accepted languages
↓
[3] ARTICLE TYPE FILTER
├─ Skips already excluded records
├─ Classifies article types
└─ Excludes specified types
↓
[4] TOPIC RELEVANCE FILTER
├─ Skips already excluded records
├─ Calculates relevance score (0.0-1.0)
└─ Excludes below minimum threshold
↓
Final Output List (CSV)
Key Principles
- Sequential Processing: Filters are applied in order: Deduplication → Language → Article Type → Topic Relevance
- Exclusion Preservation: Once excluded, a manuscript is not reprocessed by subsequent filters
- Single Exclusion Reason: Each manuscript shows only the first reason for exclusion
- Performance Optimization: Skipping excluded records reduces API calls and processing time
- Tag Accumulation: Included manuscripts may have tags from multiple filters
Example Filter Interaction
Given this configuration:
[filters.deduplication]
enabled = true
use_ai = false
compare_fields = ["doi", "title"]
[filters.language]
enabled = true
accepted_languages = ["en"]
use_ai = false
[filters.article_type]
enabled = true
exclude_editorials = true
exclude_theoretical = true # Focus on empirical work only
Processing flow for a duplicate Spanish editorial:
- Deduplication: Marked as duplicate → excluded (exclusion_reason: “Duplicate of 123”)
- Language: Skipped (already excluded) → no language detection performed
- Article Type: Skipped (already excluded) → no type classification performed
Result: Single exclusion reason preserved, no unnecessary processing.
Output Format
The screening tool saves results with comprehensive information about each manuscript and the applied filters:
CSV Output Structure
The CSV output includes the following column types:
- Original Data Columns: All columns from the input file are preserved
- Tag Columns: Added with prefix
tag_
containing filter results:tag_is_duplicate
:true
if duplicate,false
or empty otherwisetag_duplicate_of
: ID of the original record if duplicatetag_detected_language
: Primary language detected (prioritizes title)tag_title_language
: Language detected in title (when non-AI mode)tag_abstract_language
: Language detected in abstract (when non-AI mode)tag_article_type
: Classified article type (e.g., research_article, empirical_study, single_case_study)
- Status Columns:
include
:true
for included records,false
for excludedexclusion_reason
: Explanation for exclusion (e.g., “Duplicate of 123”, “Language not accepted: fr”)
Filter Processing Order
Filters are applied sequentially, and excluded records are not reprocessed:
- Deduplication: Marks duplicates, sets
include=false
with reason - Language: Skips already excluded records, processes only included ones
- Article Type: Skips already excluded records, processes only included ones
This ensures:
- Exclusion reasons are preserved from the first filter that excludes a record
- Processing efficiency by not running unnecessary filters on excluded records
- Clear traceability of why each record was excluded
Language Detection Priority
When using non-AI language detection:
- Title language takes priority over abstract language
- Many journals translate abstracts to English while keeping original titles
- Both
title_language
andabstract_language
tags are saved for transparency - The final
detected_language
uses title language when available and valid
Example CSV Output
title,abstract,doi,tag_is_duplicate,tag_duplicate_of,tag_detected_language,tag_title_language,tag_abstract_language,include,exclusion_reason
"Climate Study","Research on climate...","10.1234",false,,en,en,en,true,
"Climate Study","Research on climate...","10.1234",true,1,,,,,false,"Duplicate of 1"
"Étude climatique","Cette recherche...","10.5678",false,,fr,fr,fr,false,"Language not accepted: fr"
Practical Examples
Example 1: Basic English-Only Screening
Scenario: Screen manuscripts keeping only English research articles, removing duplicates.
[project]
name = "English Literature Review"
author = "Research Team"
version = "1.0"
input_file = "./manuscripts.csv"
output_file = "./screened_results"
text_column = "abstract"
identifier_column = "id"
output_format = "csv"
log_level = "medium"
[filters.deduplication]
enabled = true
use_ai = false
compare_fields = ["doi", "title", "authors"]
[filters.language]
enabled = true
accepted_languages = ["en"]
use_ai = false
[filters.article_type]
enabled = true
use_ai = false # Using rule-based classification
exclude_reviews = false # Keep reviews for literature review
exclude_editorials = true
exclude_letters = true
exclude_case_reports = false
exclude_commentary = false
exclude_perspectives = false
exclude_theoretical = false
exclude_empirical = false
exclude_methods = false
exclude_single_case = true # Focus on studies with multiple subjects
exclude_sample = false
Example 2: Multi-Language Screening with AI
Scenario: Accept manuscripts in English, Spanish, and Portuguese, using AI for accurate detection.
[project]
name = "Latin American Climate Research"
input_file = "./la_climate_papers.csv"
output_file = "./filtered_papers"
[filters.deduplication]
enabled = true
use_ai = true # AI helps with author name variations
compare_fields = ["title", "authors", "year"]
[filters.language]
enabled = true
accepted_languages = ["en", "es", "pt"]
use_ai = true # Better for regional language variants
[filters.article_type]
enabled = true
use_ai = true # AI classification for better accuracy
exclude_reviews = false
exclude_editorials = true
[filters.llm.1]
provider = "OpenAI"
api_key = "" # Uses OPENAI_API_KEY env variable
model = "gpt-4o-mini"
temperature = 0.01
Example 3: Strict Deduplication for Systematic Review
Scenario: Aggressive deduplication for systematic review, accepting only primary research articles.
[project]
name = "Systematic Review Screening"
log_level = "high" # Detailed logging for audit trail
[filters.deduplication]
enabled = true
use_ai = false # Faster for large datasets
compare_fields = ["doi", "title", "authors", "year", "abstract"]
[filters.language]
enabled = true
accepted_languages = ["en"]
use_ai = false
[filters.article_type]
enabled = true
use_ai = false # Using rule-based classification
exclude_reviews = true # No reviews (includes systematic reviews and meta-analyses)
exclude_editorials = true # No editorials
exclude_letters = true # No letters
exclude_case_reports = true # No case reports
exclude_commentary = true # No commentary
exclude_perspectives = true # No perspectives
exclude_theoretical = true # Only empirical work
exclude_empirical = false # Keep empirical studies
exclude_methods = false # Keep methods papers
exclude_single_case = true # Only studies with samples
exclude_sample = false # Keep sample studies
include_types = ["empirical_study", "sample_study"] # Focus on empirical research with samples
### Example 3b: Same Screening with AI Classification
**Scenario**: Same requirements but using AI for more accurate article type classification.
```toml
[project]
name = "Systematic Review Screening with AI"
log_level = "high"
[filters.deduplication]
enabled = true
use_ai = false
compare_fields = ["doi", "title", "authors", "year", "abstract"]
[filters.language]
enabled = true
accepted_languages = ["en"]
use_ai = true # AI for better language detection
[filters.article_type]
enabled = true
use_ai = true # AI for comprehensive type classification
exclude_reviews = true
exclude_editorials = true
exclude_letters = true
exclude_case_reports = true
exclude_commentary = true
exclude_perspectives = true
exclude_theoretical = true
exclude_single_case = true
include_types = ["empirical_study", "sample_study"]
[filters.llm.1]
provider = "OpenAI"
api_key = ""
model = "gpt-4o-mini"
temperature = 0.01
Example 4: Minimal Filtering for Broad Inclusion
Scenario: Keep most manuscripts, only remove obvious duplicates.
[project]
name = "Broad Literature Search"
[filters.deduplication]
enabled = true
use_ai = false
compare_fields = ["doi"] # Only exact DOI matches
[filters.language]
enabled = false # Accept all languages
[filters.article_type]
enabled = false # Accept all article types
Classification uses multiple indicators:
- Keywords and phrases
- Document structure
- Section headings
- Statistical content
- Length analysis
Input and Output Formats
Input Formats
CSV Format
doi,title,abstract,full_text_path
10.1234/example1,"Study Title 1","Abstract text...","./texts/paper1.txt"
10.1234/example2,"Study Title 2","Abstract text...","./texts/paper2.txt"
TSV/TXT Format
doi title abstract full_text_path
10.1234/example1 Study Title 1 Abstract text... ./texts/paper1.txt
10.1234/example2 Study Title 2 Abstract text... ./texts/paper2.txt
Output Formats
CSV Output
Includes original columns plus:
tag_is_duplicate
: Boolean indicating duplicationtag_duplicate_of
: ID of original if duplicatetag_detected_language
: Detected language codetag_article_type
: Classified article typeinclude
: Boolean for inclusion/exclusionexclusion_reason
: Reason if excluded
JSON Output
{
"total_records": 100,
"included_records": 75,
"excluded_records": 25,
"records": [
{
"id": "10.1234/example1",
"original_data": {...},
"tags": {
"is_duplicate": false,
"detected_language": "en",
"article_type": "research_article"
},
"include": true
}
],
"statistics": {
"duplicates_found": 10,
"language_excluded": 8,
"article_type_excluded": 7
}
}
Best Practices
Data Preparation
- Ensure consistent formatting: Clean data before screening
- Include key fields: Title, abstract, and identifiers at minimum
- Use unique identifiers: DOIs, PMIDs, or custom IDs
- Verify file paths: If using external text files, ensure paths are correct
Filter Configuration
- Start conservative: Begin with high thresholds and adjust as needed
- Order matters: Filters apply sequentially (dedup → language → type)
- Test on subset: Run on a small sample first to verify settings
- Document decisions: Keep notes on why certain filters were chosen
Performance Optimization
- Batch processing: Process large datasets in chunks if needed
- Local text files: Store full text locally when possible
- API limits: Configure rate limits to avoid API throttling
- Incremental screening: Save progress and resume if interrupted
Quality Assurance
- Review exclusions: Manually check a sample of excluded items
- Adjust thresholds: Fine-tune based on false positives/negatives
- Multiple passes: Consider running with different settings
- Keep originals: Always maintain unfiltered backup
Workflow Integration
The Screening tool fits into the systematic review workflow:
1. Literature Search
↓
2. Export Results (CSV/TSV)
↓
3. **SCREENING TOOL**
- Deduplication
- Language filtering
- Type classification
↓
4. Manual Review (reduced set)
↓
5. Download Tool (acquire selected papers)
↓
6. Convert Tool (PDF/DOCX/HTML to text)
↓
7. Review Tool (extract information)
Integration with Other prismAId Tools
- After Literature Search: Screen search results before downloading
- Before Download Tool: Filter to reduce papers to acquire
- Before Convert Tool: Only selected papers need conversion
- Before Review Tool: Ensure only relevant papers are reviewed
Example Workflow
# 1. Export search results to CSV
# (from PubMed, Web of Science, etc.)
# 2. Run screening on search results
./prismaid -screening screening_config.toml
# 3. Download only included papers
# (use filtered list from screening output)
./prismaid -download-URL filtered_urls.txt
# 4. Convert downloaded papers to text
./prismaid -convert-pdf ./papers
# 5. Run review on converted texts
./prismaid -project review_config.toml
Troubleshooting
Common Issues and Solutions
Issue: High false positive rate in deduplication
Solution:
- Increase similarity threshold (e.g., from 0.85 to 0.95)
- Use more specific comparison fields
- Switch from fuzzy to exact matching for structured data
Issue: Language detection errors
Solution:
- Enable AI-based detection for mixed-language documents
- Check text encoding (UTF-8 recommended)
- Ensure sufficient text sample (at least 100 characters)
Issue: Incorrect article type classification
Solution:
- Review classification rules and indicators
- Use AI-based classification for ambiguous cases
- Manually tag a training set for validation
Issue: Memory issues with large datasets
Solution:
- Process in smaller batches
- Use file paths instead of embedding full text
- Increase system memory allocation
Issue: API rate limits exceeded
Solution:
- Configure tpm_limit and rpm_limit in LLM settings
- Use multiple API keys with round-robin
- Implement exponential backoff
Error Messages
“text_column ‘X’ not found in CSV”
- Verify column name matches exactly (case-sensitive)
- Check for extra spaces in column headers
“at least one filter must be enabled”
- Enable at least one screening filter in configuration
“Could not read file X”
- Verify file paths are relative to current directory
- Check file permissions
Performance Tips
- For speed: Use exact matching and rule-based methods
- For accuracy: Use fuzzy/semantic matching and AI assistance
- For large datasets: Use file paths instead of inline text
- For reproducibility: Save configuration files with results
Advanced Features
Custom Field Mapping
Map non-standard column names:
[project]
text_column = "manuscript_abstract" # Your column name
identifier_column = "paper_id" # Your ID column
Multi-Language Projects
Accept multiple languages:
[filters.language]
accepted_languages = ["en", "es", "pt", "fr", "it"]
Ensemble AI Screening
Use multiple models for consensus:
[filters.llm.1]
provider = "OpenAI"
model = "gpt-4o-mini"
[filters.llm.2]
provider = "GoogleAI"
model = "gemini-1.5-flash"
Detailed Logging
High verbosity for debugging:
[project]
log_level = "high" # Saves detailed log file
For more information on systematic review workflows, see the Review Support documentation.