prismAId

Deduplication Filter

Overview

The deduplication filter identifies and removes duplicate manuscripts from your dataset using intelligent field comparison and optional AI assistance.

Configuration

Basic Configuration

[filters.deduplication]
enabled = true
use_ai = false
compare_fields = ["title", "authors", "abstract", "doi"]

Configuration Parameters

Parameter Type Default Description
enabled boolean false Enable/disable the filter
use_ai boolean false Use AI for semantic duplicate detection
compare_fields array [“title”, “abstract”] Fields to compare for duplication

How It Works

Simple Matching (Non-AI)

When use_ai = false, the filter uses intelligent field comparison:

Priority Matching Rules:

Best for: Fast processing when records have consistent metadata or minor variations

AI-Assisted Matching

When use_ai = true and LLM is configured, the filter uses semantic understanding:

AI Capabilities:

AI Prompt Used: The AI compares manuscripts considering:

Output Fields

The filter adds these fields to each manuscript record:

Field Type Description
tag_is_duplicate boolean true for duplicates, false for originals
tag_duplicate_of string ID of the original record (empty for non-duplicates)
include boolean Set to false for duplicates
exclusion_reason string “Duplicate of [ID]” for duplicates

Example Configurations

Basic Deduplication

[filters.deduplication]
enabled = true
use_ai = false
compare_fields = ["doi", "title"]

Comprehensive Deduplication

[filters.deduplication]
enabled = true
use_ai = false
compare_fields = ["title", "authors", "abstract", "doi", "year"]

AI-Enhanced Deduplication

[filters.deduplication]
enabled = true
use_ai = true
compare_fields = ["title", "authors", "abstract"]

[[filters.llm]]
provider = "OpenAI"
api_key = ""  # Uses environment variable
model = "gpt-4o-mini"
temperature = 0.01

Best Practices

  1. Field Selection: Include multiple fields for better accuracy
  2. DOI Priority: Always include DOI if available for exact matching
  3. Author Fields: Include author names to catch same-title different-author papers
  4. AI Usage: Use AI mode when dealing with:
    • Multiple database sources with different formatting
    • International datasets with character encoding variations
    • Historical data with inconsistent metadata

Performance Considerations

Filter Order

Deduplication is applied first in the screening pipeline to maximize efficiency by removing duplicates before other processing.