Development

alembica is an open-source tool written in pure Go, designed to bridge the gap between unstructured text and structured data through LLM-powered extraction.

Architecture

The project follows a modular architecture with clear separation of concerns:

The architecture enables flexible extraction workflows where users define prompts, specify models, and chain sequential operations to transform unstructured text into structured JSON datasets.

User Approach

alembica can be used in multiple ways to fit different workflows:

  1. As a Go Package: Import directly into Go applications for native integration
  2. As a C-Shared Library: Use from Python, R, C#, or other languages via FFI bindings
  3. Via MCP Server: Integrate with AI agents and autonomous systems through the Model Context Protocol

Users define extraction tasks through JSON input files that specify:

MCP Server Integration

The optional alembica-mcp server exposes core functionality as tools for AI agents:

The MCP server uses stdio transport and follows JSON-RPC 2.0 protocol, supporting only schema version v2. This enables agents to autonomously perform semantic extraction tasks as part of larger workflows.

Install with: go install github.com/open-and-sustainable/alembica/cmd/alembica-mcp@latest

Possible Development Directions

Future enhancements being considered:

  1. Enhanced Provider Support: Adding more LLM providers and keeping up with new model releases
  2. Streaming Support: Real-time extraction for large documents with progressive output
  3. Batch Processing: Optimized handling of multiple documents in parallel
  4. Schema Evolution: Tools for migrating between schema versions and managing backwards compatibility
  5. Caching Layer: Reduce redundant API calls by caching intermediate results
  6. Advanced Validation: Richer output schema validation with custom rules and constraints
  7. Observability: Enhanced logging, metrics, and tracing for production deployments
  8. Template Library: Pre-built extraction templates for common use cases (citations, entities, summaries)
  9. Multi-modal Support: Extending extraction capabilities to images and PDFs
  10. Fine-tuning Integration: Tools to generate training data from extraction results for model improvement

Contributions addressing these or other improvements are welcome!

Tests

go test ./...

Project Layout