Topic Relevance Filter
Overview
The topic relevance filter screens manuscripts based on their relevance to user-specified research topics. It calculates a relevance score from 0.0 to 1.0 for each manuscript by analyzing its content against topic descriptions. Manuscripts scoring above the minimum threshold are included; others are filtered out as off-topic.
How It Works
Core Concept
When you specify topics like “machine learning in healthcare applications”, the filter analyzes each manuscript through three components to determine relevance:
The Three Component Scores
1. Keyword Match Score (default weight: 40%)
Extracts individual meaningful words from your topic descriptions and searches for them in the manuscript.
Example:
- Topic: “machine learning in healthcare applications”
- Extracted keywords: [“machine”, “learning”, “healthcare”, “applications”]
- Searches for these exact words in title, abstract, and keywords
- Excludes stop words like “in”, “the”, “and”
- Score = (matched keywords / total keywords) × 1.5 (capped at 1.0)
2. Concept Match Score (default weight: 40%)
Identifies multi-word phrases and concepts, capturing domain-specific terminology.
Example:
- Topic: “deep learning models for clinical decision support”
- Extracted concepts: [“deep learning”, “learning models”, “clinical decision”, “decision support”]
- Searches for complete phrases in the manuscript
- Catches technical terms that have specific meaning when words appear together
3. Field Relevance Score (default weight: 20%)
Examines journal name, research field, and subject categories to determine domain alignment.
Example:
- Papers from “Journal of Medical Artificial Intelligence” get high scores for AI healthcare topics
- Papers from “Agricultural Science Quarterly” get low scores
- Helps filter papers that coincidentally use similar words but are from unrelated fields
Score Calculation
overall_score = (keyword_score × weight1) + (concept_score × weight2) + (field_score × weight3)
Example Scoring:
Manuscript: “A deep learning approach for medical diagnosis using neural networks”
- Keyword matches: “deep”, “learning”, “medical”, “diagnosis” → Score: 0.8
- Concept matches: “deep learning”, “medical diagnosis” → Score: 0.7
- Field: Published in “IEEE Transactions on Medical Imaging” → Score: 0.9
- Overall score = (0.8 × 0.4) + (0.7 × 0.4) + (0.9 × 0.2) = 0.78
With min_score = 0.5
, this manuscript would be included.
Configuration
Basic Configuration
[filters.topic_relevance]
enabled = true
use_ai = false # Use rule-based scoring
min_score = 0.5 # Minimum score threshold (0.0-1.0)
# Define your research topics
topics = [
"machine learning applications in healthcare",
"artificial intelligence for medical diagnosis",
"deep learning models for clinical decision support",
"natural language processing for electronic health records"
]
# Scoring component weights
[filters.topic_relevance.score_weights]
keyword_match = 0.4 # Weight for keyword matching
concept_match = 0.4 # Weight for concept matching
field_relevance = 0.2 # Weight for field/journal relevance
Configuration Parameters
Parameter | Type | Default | Description |
---|---|---|---|
enabled |
boolean | false | Enable/disable the filter |
use_ai |
boolean | false | Use AI for semantic understanding (requires LLM config) |
topics |
array | [] | List of topic descriptions |
min_score |
float | 0.5 | Minimum relevance score (0.0-1.0) |
score_weights.keyword_match |
float | 0.4 | Weight for keyword matching |
score_weights.concept_match |
float | 0.4 | Weight for concept matching |
score_weights.field_relevance |
float | 0.2 | Weight for field relevance |
Weight Adjustment Strategies
Technical/Specific Search
Increase concept_match weight for technical phrase emphasis:
[filters.topic_relevance.score_weights]
keyword_match = 0.3
concept_match = 0.5 # Higher weight for technical phrases
field_relevance = 0.2
Broad Topic Search
Increase keyword_match weight for individual term focus:
[filters.topic_relevance.score_weights]
keyword_match = 0.5 # Focus on individual terms
concept_match = 0.3
field_relevance = 0.2
Domain-Specific Search
Increase field_relevance weight for journal/field emphasis:
[filters.topic_relevance.score_weights]
keyword_match = 0.3
concept_match = 0.3
field_relevance = 0.4 # Emphasize relevant journals
Writing Effective Topics
Good Topics - Detailed and Specific
topics = [
"machine learning algorithms for predicting patient readmission rates",
"deep learning models for medical image segmentation in radiology",
"natural language processing for extracting clinical information from doctor notes"
]
Poor Topics - Too Generic
topics = [
"AI",
"machine learning",
"healthcare"
]
Output Fields
The filter adds these fields to each manuscript record:
Field | Type | Description |
---|---|---|
topic_relevance_score |
float | Overall relevance score (0.0-1.0) |
topic_relevance_confidence |
float | Confidence in the assessment |
matched_keywords |
array | List of keywords that matched |
matched_concepts |
array | List of concepts that matched |
exclusion_reason |
string | Explanation if manuscript is excluded |
component_scores |
object | Individual scores for keyword, concept, and field relevance (AI mode) |
reasoning |
string | AI’s explanation of relevance decision (AI mode only) |
Score Interpretation
Score Range | Interpretation | Typical Action |
---|---|---|
0.0 - 0.3 | Low relevance | Usually exclude |
0.3 - 0.5 | Moderate relevance | Review threshold setting |
0.5 - 0.7 | Good relevance | Usually include |
0.7 - 1.0 | High relevance | Definitely include |
Example Use Cases
Focused Literature Review
[filters.topic_relevance]
enabled = true
use_ai = false
topics = [
"machine learning for depression detection from social media",
"natural language processing for suicide risk assessment",
"AI-powered chatbots for mental health support",
"predictive models for psychiatric treatment outcomes"
]
min_score = 0.6 # Require strong relevance
Broad Technology Survey
[filters.topic_relevance]
enabled = true
use_ai = false
topics = [
"artificial intelligence in medical imaging and diagnostics",
"machine learning for drug discovery and development",
"AI-assisted surgical planning and navigation",
"predictive analytics for population health management",
"deep learning for genomics and precision medicine"
]
min_score = 0.4 # Lower threshold for broader coverage
Methodology-Focused Search
[filters.topic_relevance]
enabled = true
use_ai = false
topics = [
"transformer models for clinical text analysis",
"graph neural networks for drug-drug interaction prediction",
"reinforcement learning for treatment recommendation systems",
"federated learning for privacy-preserving medical AI"
]
min_score = 0.5
[filters.topic_relevance.score_weights]
keyword_match = 0.5 # Higher weight on technical terms
concept_match = 0.3
field_relevance = 0.2
AI-Enhanced Mode (Optional)
When use_ai = true
, the filter uses Large Language Models for deep semantic understanding of topic relevance, going beyond keyword matching to understand conceptual relationships and research context.
AI Configuration
[filters.topic_relevance]
enabled = true
use_ai = true
min_score = 0.6 # Often higher threshold with AI
# LLM configuration required
[[filters.llm]]
provider = "OpenAI"
api_key = "" # Uses environment variable if empty
model = "gpt-4o-mini"
temperature = 0.01
AI Capabilities
The AI-powered mode provides:
- Semantic Understanding: Recognizes conceptual relationships beyond literal word matches
- Context Awareness: Understands research methodology alignment with topics
- Synonym Recognition: Identifies related terms and domain-specific vocabulary
- Interdisciplinary Connections: Finds relevance across different fields
- Research Question Alignment: Evaluates if research objectives match topic interests
AI Prompt Details
The AI evaluates manuscripts using comprehensive analysis:
Evaluation Criteria:
- Direct keyword matches with the topics
- Conceptual alignment with the research areas
- Field/domain relevance
- Methodological relevance
- Research questions and objectives alignment
Structured Response: The AI returns detailed scoring with explanations:
{
"overall_score": 0.75, // Relevance score from 0.0 to 1.0
"component_scores": {
"keyword_match": 0.8, // Direct keyword alignment
"concept_match": 0.7, // Conceptual relationship strength
"field_relevance": 0.75 // Domain/field alignment
},
"matched_keywords": ["machine learning", "healthcare"],
"matched_concepts": ["predictive modeling", "clinical decision support"],
"confidence": 0.85, // AI's confidence in assessment
"is_relevant": true, // Boolean relevance decision
"reasoning": "Strong alignment with AI healthcare topics through predictive modeling approach"
}
Graceful Fallback
The system automatically falls back to rule-based scoring when:
- No LLM models are configured
- API calls fail or timeout
- Response parsing errors occur
- Invalid JSON response from AI
This ensures the filter always produces results, never blocking the screening pipeline.
AI Mode Examples
Interdisciplinary Research Detection
AI mode excels at identifying relevant interdisciplinary work:
topics = [
"machine learning for climate change mitigation",
"AI-driven renewable energy optimization"
]
A paper on “Neural Network Control Systems for Wind Turbine Efficiency” would be recognized as relevant through conceptual understanding, even without exact keyword matches.
Methodology-Based Relevance
AI understands methodological alignment:
topics = [
"deep learning approaches for medical imaging",
"convolutional neural networks in radiology"
]
Papers using CNNs for any medical imaging task would score high, regardless of specific medical terminology used.
Benefits of AI Mode
- Reduced False Negatives: Catches relevant papers that keyword matching might miss
- Contextual Understanding: Evaluates research questions and objectives, not just terms
- Flexible Topic Interpretation: Natural language topic descriptions work effectively
- Confidence Scoring: Provides reliability metric for each assessment
- Detailed Reasoning: Explains why manuscripts are considered relevant or not
Performance Considerations
Rule-Based Mode
- Speed: Very fast, processes hundreds of manuscripts per second
- Accuracy: Good for clear topic matches with direct keyword alignment
- Cost: Free, no API calls required
- Best for: Well-defined topics with clear terminology
AI-Enhanced Mode
- Speed: Limited by API rate limits (typically 10-60 requests per minute)
- Accuracy: Superior semantic understanding and context awareness
- Cost: API costs apply (approximately $0.001-0.005 per manuscript)
- Best for:
- Interdisciplinary research topics
- Conceptual or theoretical topics
- Emerging research areas with evolving terminology
- High-precision screening requirements
Optimization Tips
- Use AI Selectively: Apply AI mode after initial filtering to reduce costs
- Batch Processing: The filter supports efficient batch processing
- Rate Limit Configuration: Adjust TPM/RPM limits based on your API tier
- Hybrid Approach: Use rule-based for initial screening, AI for borderline cases
Filter Order
Topic relevance is applied after:
- Deduplication
- Language detection
- Article type classification
This ensures efficient processing by removing duplicates and non-target language papers first.