Convert Tool
Page Contents
- Purpose and Capabilities: what the Convert tool does and why it's necessary
- Usage Methods: how to use the tool across different platforms and programming languages
- Supported File Formats: details on each file format the tool can process
- Conversion Process: how the tool works and what to expect
- Best Practices: recommendations for effective conversions
- Limitations and Considerations: important factors to be aware of
- Troubleshooting: solutions to common conversion issues
- Workflow Integration: how the Convert tool fits into your systematic review process
Purpose and Capabilities
The prismAId Convert tool transforms documents from their native formats into plain text files that can be processed by Large Language Models (LLMs). This critical step bridges the literature acquisition phase and the systematic review analysis by:
- Standardizing formats: Converting various document types into a consistent plain text format
- Extracting content: Pulling textual content from complex formatted documents
- Preparing for analysis: Creating files that are optimized for LLM processing
The tool currently supports three main document formats: PDF, DOCX, and HTML, making it versatile for handling various sources of scientific literature.
Usage Methods
The Convert tool can be accessed through multiple interfaces to accommodate different workflows:
Binary (Command Line)
# Convert all PDFs in a directory
./prismaid -convert-pdf ./papers
# Convert all DOCX files in a directory
./prismaid -convert-docx ./papers
# Convert all HTML files in a directory
./prismaid -convert-html ./papers
Go Package
import "github.com/open-and-sustainable/prismaid"
// Convert files of specified formats in a directory
err := prismaid.Convert("./papers", "pdf,docx,html")
Python Package
import prismaid
# Convert files of specified formats in a directory
prismaid.convert("./papers", "pdf,docx,html")
R Package
library(prismaid)
# Convert files of specified formats in a directory
Convert("./papers", "pdf,docx,html")
Julia Package
using PrismAId
# Convert files of specified formats in a directory
PrismAId.convert("./papers", "pdf,docx,html")
Supported File Formats
PDF (.pdf)
PDF (Portable Document Format) is the most common format for published scientific papers. The Convert tool uses advanced text extraction techniques to handle complex PDF structures:
- Text Elements: Extracts main body text, headings, and captions
- Text Flow: Attempts to maintain proper reading order
- Multi-Column Handling: Processes papers with multiple column layouts
- Basic Table Detection: Attempts to preserve tabular data
Limitations: Due to the nature of PDFs, which are essentially digital printouts, text extraction can be imperfect. Some formatting, mathematical equations, and specialized symbols may not convert accurately.
DOCX (.docx)
Microsoft Word documents (.docx) are common for manuscripts in development or preprints:
- Text Extraction: Preserves most text formatting and structure
- List Handling: Maintains numbered and bulleted lists
- Table Support: Extracts content from tables
- Document Structure: Preserves headings and document organization
Limitations: Some complex formatting elements like text boxes or embedded objects may not convert perfectly.
HTML (.html)
HTML files are often used for web-published articles or open-access content:
- Text Content: Extracts main article content
- Structural Elements: Preserves headings and sectioning
- List Elements: Maintains ordered and unordered lists
- Basic Table Support: Extracts tabular data
Limitations: Dynamic content, JavaScript-generated text, or complex layouts may not be fully captured.
Conversion Process
The Convert tool follows a standardized process:
- File Discovery: The tool scans the specified directory for files of the requested format(s)
- Content Extraction: For each file, the appropriate extraction method is applied based on file type
- Text Processing: Extracted text is processed to remove unnecessary elements and normalize formatting
- Output Generation: A plain text (.txt) file is created for each input document, maintaining the same filename but with a .txt extension
Best Practices
To achieve optimal conversion results:
- Pre-conversion check:
- Ensure PDF files are text-based, not scanned images
- Verify that documents are not password-protected or damaged
- Check that files are complete and correctly formatted
- Directory organization:
- Keep original files and converted text files organized in separate directories
- Use consistent file naming conventions to maintain traceability
- Post-conversion verification:
- IMPORTANT: Always manually check a sample of converted documents to ensure quality
- Pay special attention to papers with complex formatting, equations, or non-standard characters
- Consider spot-checking longer documents to verify that all content was properly extracted
- Handling special cases:
- For papers with significant mathematical content, consider additional manual editing
- For papers with important tables or figures, supplementary notes may be needed
- Non-English papers may require special attention to character encoding
Limitations and Considerations
IMPORTANT: The conversion process has inherent limitations that users should be aware of:
- PDF Limitations:
- PDFs store formatting rather than semantic structure, making perfect extraction challenging
- Multi-column layouts may occasionally be extracted in incorrect order
- Figures and their captions may be separated or misplaced
- Mathematical equations often convert poorly to plain text
- Headers, footers, and page numbers may appear in the middle of content
- Text Recognition Issues:
- Non-standard fonts may cause character recognition problems
- Ligatures and special characters might not be preserved correctly
- Text in images cannot be extracted (including scanned PDF documents)
- Structural Information Loss:
- Formatting that conveys meaning (bold, italic, etc.) is lost in plain text
- Document hierarchy may not be perfectly preserved
- References to figures or tables by location (“see Figure 2 below”) may lose context
- Special Content:
- Tables are particularly challenging and may lose their structure
- Citations and references may not maintain their formatting
- Footnotes may be displaced from their reference points
Troubleshooting
Common Issues and Solutions
- Empty or Very Short Output Files:
- Issue: The conversion produced an empty or minimal text file
- Possible causes:
- The PDF is a scanned image without text layers
- The document is corrupt or password-protected
- The file contains primarily non-textual elements
- Solution: Use OCR software to convert image-based PDFs, or manually type/transcribe critical content
- Garbled Text:
- Issue: Output contains random characters or illegible text
- Possible causes:
- Non-standard encoding
- Custom fonts without proper mapping
- Copy protection mechanisms
- Solution: Try opening the original in different applications and copying text manually, or contact the publisher for an accessible version
- Incomplete Conversion:
- Issue: Only part of the document was converted
- Possible causes:
- File corruption
- Complex document structure that confused the parser
- Solution: Try alternative conversion tools or split large documents into smaller sections
- Character Encoding Issues:
- Issue: Special characters appear incorrectly
- Possible causes:
- Mismatched character encoding
- Non-standard character sets
- Solution: Manually correct critical passages or try using different encoding options if your programming language interface allows it
Workflow Integration
The Convert tool is a critical bridge in the systematic review workflow:
- Literature Identification:
- Search databases and identify relevant papers
- Literature Acquisition (Download Tool):
- Download papers from Zotero collections or URL lists
- Format Conversion (Convert Tool):
- Convert downloaded papers to text format for analysis
- Verify conversion quality before proceeding
- Review Configuration:
- Set up your review project configuration
- Systematic Review (Review Tool):
- Process the converted text files to extract structured information
The Convert tool’s output directly feeds into the Review tool, making the quality of conversion a critical factor in the success of your systematic review. Always allocate sufficient time for post-conversion verification to ensure your review is based on accurately extracted text.