Language Detection Filter
Overview
The language detection filter identifies the primary language of manuscripts and filters based on accepted languages. It can operate in two modes: rule-based pattern matching or AI-assisted semantic detection.
Configuration
Basic Configuration
[filters.language]
enabled = true
accepted_languages = ["en", "es", "fr"]
use_ai = false
Configuration Parameters
Parameter | Type | Default | Description |
---|---|---|---|
enabled |
boolean | false | Enable/disable the filter |
accepted_languages |
array | [“en”] | ISO 639-1 language codes to accept |
use_ai |
boolean | false | Use AI for detection (requires LLM config) |
How It Works
Processing Order
- Language detection runs after deduplication (skips already excluded duplicates)
- Analyzes each manuscript’s title, abstract, and journal fields
- Determines primary language
- Excludes manuscripts not in the accepted languages list
Field Priority
- Title language has priority over abstract language
- Many scientific databases translate abstracts to English while keeping original titles
- Journal names can indicate regional publications (e.g., “Revista Española”, “Deutsche Zeitschrift”)
Detection Methods
Rule-Based Detection (use_ai = false)
When use_ai = false
, the filter uses pattern matching:
Detection Method:
- Analyzes character scripts (Latin, Cyrillic, CJK, Arabic, Hebrew, Greek)
- Checks for common words in major languages (English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Chinese, Japanese, Arabic)
- Fast and privacy-preserving (no external API calls)
- Works offline without dependencies
Limitations:
- May struggle with short titles or mixed-language content
- Limited to major languages with predefined word lists
- Less accurate for technical/scientific text with many Latin terms
AI-Assisted Detection (use_ai = true)
When use_ai = true
and LLM is configured, the filter uses semantic understanding:
Detection Method:
- Sends title, abstract, and journal fields to configured LLM
- Uses specialized prompt that understands scientific manuscript conventions
- Recognizes that abstracts are often translated while titles remain in original language
- Handles character encoding variations (é→e, ü→u, ñ→n)
- Identifies primary language even in mixed-language documents
Graceful Fallback:
- If no LLM is configured → falls back to rule-based detection
- If API call fails → falls back to rule-based detection
- If response parsing fails → falls back to rule-based detection
- Always provides a result, never fails completely
Output Fields
The filter adds these fields to each manuscript record:
Field | Type | Description |
---|---|---|
tag_detected_language |
string | Final detected language (prioritizes title) |
tag_title_language |
string | Language detected in title field |
tag_abstract_language |
string | Language detected in abstract field |
tag_ai_detected_language |
string | Language detected by AI (when use_ai=true) |
exclusion_reason |
string | “Language not accepted: [language]” if excluded |
Supported Language Codes
Common ISO 639-1 language codes:
en
- Englishes
- Spanishfr
- Frenchde
- Germanit
- Italianpt
- Portuguesenl
- Dutchru
- Russianzh
- Chineseja
- Japaneseko
- Koreanar
- Arabic
Example Configurations
English-Only Screening
[filters.language]
enabled = true
accepted_languages = ["en"]
use_ai = false
Multi-Language European Collection
[filters.language]
enabled = true
accepted_languages = ["en", "es", "fr", "de", "it", "pt"]
use_ai = false
Multi-Language with AI Detection
[filters.language]
enabled = true
accepted_languages = ["en", "es", "fr", "de"]
use_ai = true
[[filters.llm]]
provider = "OpenAI"
api_key = "" # Uses environment variable
model = "gpt-4o-mini"
temperature = 0.01
Accept All Languages (Detection Only)
[filters.language]
enabled = true
accepted_languages = [] # Empty means accept all
use_ai = false # Still detects and tags language
Performance Considerations
Rule-Based Mode
- Speed: Very fast (milliseconds per manuscript)
- Accuracy: Good for major languages with distinct patterns
- Cost: Free, no API calls
AI-Assisted Mode
- Speed: Depends on API latency
- Accuracy: Better for edge cases and mixed languages
- Cost: API costs apply
Best Practices
- Title Priority: Trust title language over abstract due to translation practices
- Journal Context: Consider journal names as language indicators
- AI for Edge Cases: Use AI mode when dealing with:
- Regional publications
- Mixed-language collections
- Manuscripts with technical Latin terms
- Fallback Strategy: AI mode always falls back to rule-based on errors
Filter Order
Language detection is applied second in the screening pipeline, after deduplication but before article type classification. This ensures:
- No duplicate processing
- Language tags available for downstream filters
- Efficient exclusion of non-target language papers