Development
alembica is an open-source tool written in pure Go, designed to bridge the gap between unstructured text and structured data through LLM-powered extraction.
Architecture
The project follows a modular architecture with clear separation of concerns:
definitions/: Core data structures and input/output schemas (supports v1 and v2 schema versions)validation/: Schema validation helpers ensuring data integrityextraction/: Prompt sequencing engine and model execution orchestrationllm/: Provider integrations (OpenAI, Anthropic, Google AI, Cohere, DeepSeek, Perplexity, AWS Bedrock, Azure AI, Vertex AI, Self-Hosted)pricing/: Token-based cost estimation for cloud providersutils/: Logging utilities and shared library exports for cross-language interoperability
The architecture enables flexible extraction workflows where users define prompts, specify models, and chain sequential operations to transform unstructured text into structured JSON datasets.
User Approach
alembica can be used in multiple ways to fit different workflows:
- As a Go Package: Import directly into Go applications for native integration
- As a C-Shared Library: Use from Python, R, C#, or other languages via FFI bindings
- Via MCP Server: Integrate with AI agents and autonomous systems through the Model Context Protocol
Users define extraction tasks through JSON input files that specify:
- Schema version and metadata
- Model configurations (provider, model ID, temperature, optional endpoints)
- Prompt sequences with content and ordering
- Optional output validation schemas
MCP Server Integration
The optional alembica-mcp server exposes core functionality as tools for AI agents:
alembica_validate_input: Validates input schema before processingalembica_validate_output: Ensures extracted data matches expected schemaalembica_extract: Executes the full extraction pipelinealembica_compute_costs: Estimates token costs for planned operationsalembica_list_schemas: Lists available schema versions
The MCP server uses stdio transport and follows JSON-RPC 2.0 protocol, supporting only schema version v2. This enables agents to autonomously perform semantic extraction tasks as part of larger workflows.
Install with: go install github.com/open-and-sustainable/alembica/cmd/alembica-mcp@latest
Possible Development Directions
Future enhancements being considered:
- Enhanced Provider Support: Adding more LLM providers and keeping up with new model releases
- Streaming Support: Real-time extraction for large documents with progressive output
- Batch Processing: Optimized handling of multiple documents in parallel
- Schema Evolution: Tools for migrating between schema versions and managing backwards compatibility
- Caching Layer: Reduce redundant API calls by caching intermediate results
- Advanced Validation: Richer output schema validation with custom rules and constraints
- Observability: Enhanced logging, metrics, and tracing for production deployments
- Template Library: Pre-built extraction templates for common use cases (citations, entities, summaries)
- Multi-modal Support: Extending extraction capabilities to images and PDFs
- Fine-tuning Integration: Tools to generate training data from extraction results for model improvement
Contributions addressing these or other improvements are welcome!
Tests
go test ./...
Project Layout
definitions/: input/output schema and core structuresvalidation/: schema validation helpersextraction/: prompt sequencing and model executionllm/: provider integrations and token checkspricing/: cost estimationutils/: logging and shared library exports