Transformation Pipelines Across Domains

I began building Camellia, a zero-knowledge circuit compiler, as a way to explore how circuit descriptions could be compiled to R1CS constraint systems. This involved designing a domain-specific language for writing cryptographic circuits, implementing constraint generation algorithms, and building a complete compilation pipeline from high-level circuit descriptions to mathematical proof systems.

Later, as part of my LFX internship application, I had to complete a coding challenge that required building a structured data converter - sx, which transforms JSON, YAML, and TOML into S-expressions. On the surface, this seemed like a straightforward data munging task compared to the mathematical complexity of zero-knowledge circuits.

But as I implemented sx, something felt eerily familiar despite the significant difference in domain complexity. Strip away the domain specifics, the field arithmetic, the hash functions, the JSON parsing quirks, and you're left with very similar architectural challenges.

The Common Pattern

Both projects follow the same basic transformation pipeline:

Camellia:

Circuit DSL → Parse → AST → Compiler → R1CS → Output (JSON/Circom/etc)

sx:

JSON/YAML/TOML → Parse → AST → Transform → S-expressions → Output

At the architectural level, they're both implementing compilers with similar phases: lexical analysis, parsing to an intermediate representation, transformation, and code generation. The difference is just what they're compiling - circuit descriptions vs data formats.

Architectural Patterns Across Complexity Levels

1. AST Design as the Core Abstraction

The most instructive comparison lies in how both projects approach intermediate representation design, albeit at very different complexity levels.

Camellia's AST design was genuinely complex - it needed to capture circuit semantics, support mathematical operations over finite fields, and enable efficient constraint generation:

type expr =
  | Var of string
  | Const of string  
  | Add of expr * expr
  | Mul of expr * expr
  | Equal of expr * expr
  | Poseidon of expr list

type stmt =
  | Constraint of expr
  | Assign of string * expr

The design choices here directly impact compilation complexity. For instance, having Equal as an expression rather than just in constraints allows for more flexible constraint generation, but requires careful handling during R1CS conversion. Each variant maps to specific constraint generation patterns in the finite field.

sx's AST requirements were simpler but followed the same fundamental principle - design an intermediate representation that unifies disparate inputs while preserving semantic information:

type value =
  | Null of Position.t
  | Bool of bool * Position.t
  | Int of int * Position.t
  | Float of float * Position.t
  | String of string * Position.t
  | List of value list * Position.t
  | Assoc of (string * value) list * Position.t

While much simpler than Camellia's mathematical constraints, sx's AST demonstrates the same core principle: a well-designed intermediate representation enables clean separation between parsing and generation phases. The position tracking serves the same fundamental purpose as Camellia's constraint metadata - preserving source context through transformation.

2. Error Handling and Position Tracking

Both projects needed comprehensive error handling systems, but with different requirements.

Camellia's error handling requirements were complex due to the multi-stage compilation process. The system needed to handle syntax errors in the circuit DSL, semantic errors in variable binding and type checking, and mathematical errors during constraint generation:

type error_kind =
  | ParseError of string | UnexpectedToken of string * string
  | TypeError of string | UnboundVariable of string
  | CryptoError of string | InvalidFieldElement of string
  | R1CSError of string | ConstraintUnsatisfiable of string
  | CircuitError of string | WitnessGenerationFailed of string
  (* ... 15+ total error variants for comprehensive coverage *)

type error = {
  kind: error_kind;
  position: position option;
  message: string;
}

The challenge lies in mapping constraint generation errors back to source locations - when a mathematical constraint fails, you need to trace it back through the compilation pipeline to the original circuit description.

sx's error handling was more straightforward but applied the same architectural principles. The main challenge was providing consistent error reporting across different input formats:

type error_kind =
  | ParseError of string | TypeError of string | IOError of string
  | UnsupportedFeature of string | FormatDetectionError of string

type error = {
  kind : error_kind;
  position : Position.t;
  source_context : string option;
}

type validation_result = {
  filename : string;
  success : bool;
  error : error option;
  warning_count : int;
}

The streaming mode adds another layer of complexity - errors need to be reported without stopping the entire stream, which requires careful error recovery strategies.

3. Performance vs Generality Trade-offs

The projects took different approaches to the performance/flexibility trade-off, which became clear as I worked on both.

With Camellia, I optimized heavily for constraint system efficiency:

Constant reuse to minimize R1CS size (fewer constraints = faster proving)
Wire allocation strategies that reduce constraint complexity
Field arithmetic optimized for the BN254 curve

This specialization pays off - a hash preimage circuit compiles to just 5 constraints with 6 variables. But it's completely domain-specific.

For sx, the coding challenge requirements pushed me in a different direction - towards operational flexibility:

Auto-detection handles mixed file types without configuration
Streaming processing provides constant memory usage for arbitrarily large inputs
Multiple output formats (Common Lisp vs Scheme S-expressions)

The streaming implementation is particularly interesting. Instead of loading entire files into memory, it processes JSON arrays and JSON Lines incrementally:

type 'a stream_result = 
  | StreamItem of 'a
  | StreamEnd  
  | StreamError of string

This enables processing multi-gigabyte files with constant memory usage (using 8KB configurable buffers), but adds significant implementation complexity compared to batch processing.

4. Compilation Pipeline Implementation Details

Both projects implement multi-stage compilation pipelines, but with very different complexity levels in each stage.

Camellia's compilation stages:

Lexing/Parsing: Uses Menhir parser generator with custom lexer. The grammar handles circuit syntax including field operations, variable bindings, and constraint assertions:
```
circuit HashPreimage {
  inputs: expected_hash
  private inputs: preimage
  computed_hash = poseidon(preimage)
  assert computed_hash == expected_hash
}
```

Semantic Analysis: Implements scope checking with variable binding validation. The compiler maintains a compilation context:

type compilation_context = {
  debug_ctx: debug_context;
  wire_manager: R1cs.wire_manager;
  r1cs: R1cs.r1cs_system;
  variable_bindings: (string, R1cs.wire_id) Hashtbl.t;
  scope_stack: (string, R1cs.wire_id) Hashtbl.t list;
}

Constraint Generation: The core complexity lies here - converting high-level operations to quadratic constraints. For example, addition becomes:

let create_addition_constraint a_wire b_wire sum_wire =
  let a_lc = add_term a_wire Field.one (add_term b_wire Field.one empty_linear_combination) in
  let b_lc = add_term 0 Field.one empty_linear_combination in
  let c_lc = add_term sum_wire Field.one empty_linear_combination in
  create_constraint a_lc b_lc c_lc

Wire Management: Implements sophisticated allocation and reuse:
```
type wire_manager = {
  mutable next_id: wire_id;
  mutable wires: (wire_id, wire_info) Hashtbl.t;
  mutable name_to_id: (string, wire_id) Hashtbl.t;
}
```
This enables constant deduplication - when the same constant appears multiple times, it reuses the same wire, reducing constraint count from potentially hundreds to just 5 constraints for a hash preimage circuit.
Backend Generation: Supports multiple output formats with different constraint representations for Circom, Bellman, and raw JSON.

sx's transformation stages:

Format Detection: Implements heuristic-based detection combining file extensions and content analysis:

let detect_by_content content =
  let trimmed = String.trim content in
  if trimmed = "" then None
  else
    let first_char = trimmed.[0] in
    match first_char with
    | '{' | '"' -> Some Ast.JSON
    | '[' -> 
        if String.contains content '=' then Some Ast.TOML else Some Ast.JSON
    | '-' when String.length trimmed > 2 && trimmed.[1] = '-' && trimmed.[2] = '-' -> 
        Some Ast.YAML
    | _ ->
        if String.contains trimmed '=' && String.contains trimmed ':' then
          Some Ast.TOML
        else if String.contains trimmed ':' && not (String.contains trimmed '{') then
          Some Ast.YAML

Streaming Architecture: The most complex part of sx is the streaming JSON parser that handles large files with constant memory:

let parse_json_array_stream ~filename channel =
  let state = ref `ExpectOpenBracket in
  let current_item = Buffer.create 256 in
  let brace_depth = ref 0 in
  let in_string = ref false in
  (* incremental parsing logic *)

S-expression Generation: Different generators for Common Lisp vs Scheme with distinct association list representations:

(* Common Lisp: simple association lists *)
| Ast.Assoc (assoc, _) ->
    let pairs = List.map (fun (key, value) ->
      Ast.List [Ast.Atom (escape_string key); ast_to_sexp value]
    ) assoc in
    Ast.List pairs

(* Scheme: explicit constructors with cons *)
| Ast.Assoc (assoc, _) ->
    let pairs = List.map (fun (key, value) ->
      Ast.List [Ast.Atom "cons"; Ast.Atom (escape_string key); ast_to_sexp value]
    ) assoc in
    Ast.List (Ast.Atom "list" :: pairs)

The key difference: Common Lisp output generates nested lists (("name" "Lua") ("age" 4)) while Scheme output uses explicit constructors (list (cons "name" "Lua") (cons "age" 4)).

The key insight is that while sx's individual stages are simpler, the architectural patterns remain consistent: lexical analysis, intermediate representation, transformation, and generation.

Lessons from Complex to Simple

What struck me was how architectural principles learned from building Camellia's complex compilation pipeline directly informed the simpler sx implementation. The mathematical complexity of zero-knowledge circuits forced me to think carefully about compilation architecture, and those same patterns proved valuable even for straightforward data conversion.

Some specific insights that transferred from the complex domain to the practical one:

Error Recovery Strategies: The comprehensive error handling I developed for Camellia informed how I approached sx's multi-format parsing challenges. Both needed to provide actionable feedback without halting the entire process.

AST Design Principles: The position tracking patterns I learned from Camellia's constraint-to-source mapping proved directly applicable to sx's error reporting across different input formats.

Performance Profiling: Having built performance analysis into Camellia's constraint generation, I knew to include similar pipeline timing/memory tracking in sx from the start.

Universal Patterns in Transformation Pipelines

Building a complex DSL like Camellia and then applying those patterns to a practical tool like sx reinforced several architectural principles that seem to hold regardless of domain complexity:

Intermediate representations matter more than you think - The AST design constraints everything downstream
Error handling is architectural, not tactical - Plan for failure modes from the start
Performance optimization requires choosing a target - General-purpose flexibility and domain-specific performance are fundamentally in tension
Streaming is a different paradigm - Memory-bounded processing requires rethinking the entire pipeline

Deep Technical Implementation Details

Camellia's Mathematical Foundation:

The core complexity lies in R1CS (Rank-1 Constraint System) generation. Each constraint has the form A × B = C where A, B, C are linear combinations of wires:

type constraint_triple = {
  a: linear_combination;  (* Σ(coeff_i × wire_i) *)
  b: linear_combination;
  c: linear_combination;
}

For a simple multiplication x * y = z, this becomes:

A = wire_x (coefficient 1)
B = wire_y (coefficient 1)
C = wire_z (coefficient 1)

But addition requires encoding as multiplication: (x + y) * 1 = sum, demonstrating why R1CS is a challenging compilation target.

The constraint generation uses field arithmetic over BN254 (modulus: 21888242871839275222246405745257275088696311157297823662689037894645226208583):

let compile_add ctx e1 e2 =
  let* (result1, ctx1) = compile_expr ctx e1 in
  let* (result2, ctx2) = compile_expr ctx1 e2 in
  let wire_name = Printf.sprintf "add_%d_%d" result1.wire_id result2.wire_id in
  let* (sum_wire, ctx3) = allocate_wire ctx2 wire_name R1cs.Intermediate () in
  let addition_constraint = R1cs.create_addition_constraint result1.wire_id result2.wire_id sum_wire in
  let final_ctx = add_constraint_to_context ctx3 addition_constraint in
  let all_constraints = result1.constraints @ result2.constraints @ [addition_constraint] in
  Ok ({ wire_id = sum_wire; constraints = all_constraints }, final_ctx)

The actual create_addition_constraint implements the R1CS encoding:

let create_addition_constraint a_wire b_wire sum_wire =
  let a_lc = add_term a_wire Field.one (add_term b_wire Field.one empty_linear_combination) in
  let b_lc = add_term 0 Field.one empty_linear_combination in
  let c_lc = add_term sum_wire Field.one empty_linear_combination in
  create_constraint a_lc b_lc c_lc

This creates the constraint (a_wire + b_wire) × 1 = sum_wire, where wire 0 represents the constant 1.

The field arithmetic implementation validates all values against the BN254 modulus:

let validate_field_string s =
  if String.length s = 0 then
    Error (invalid_field_element s ~context:"empty string" ())
  else if String.for_all (function '0'..'9' -> true | _ -> false) s then
    if String.length s > String.length bn254_modulus then
      Error (invalid_field_element s ~context:"exceeds field modulus" ())
    else Ok s
  else Error (invalid_field_element s ~context:"non-numeric characters" ())

Constraint validation ensures mathematical soundness by checking A × B = C for each constraint:

let validate_constraint constraint_triple witness =
  let* a_val = evaluate_linear_combination constraint_triple.a witness in
  let* b_val = evaluate_linear_combination constraint_triple.b witness in
  let* c_val = evaluate_linear_combination constraint_triple.c witness in
  let* ab_product = Field.mul a_val b_val in
  let* equal_result = field_equal_evaluated (Field.to_string ab_product) (Field.to_string c_val) in
  if equal_result then Ok () else
    Result.Error (constraint_unsatisfiable 
      (Printf.sprintf "Constraint not satisfied: (%s) * (%s) != (%s)" 
        (Field.to_string a_val) (Field.to_string b_val) (Field.to_string c_val)) ())

sx's Multi-Format Processing Architecture:

The complexity of sx lies in unifying three distinct data formats with different parsing paradigms, type systems, and structural conventions into a coherent transformation pipeline.

Format Detection and Unified Processing:

The system implements heuristics combining file extensions and content analysis to automatically detect input formats:

let detect_format ~filename content =
  let ext_hint = match Filename.extension filename with
    | ".json" | ".jsonl" -> Some JSON | ".yaml" | ".yml" -> Some YAML  
    | ".toml" -> Some TOML | _ -> None in
  let content_hint = match String.trim content with
    | s when String.length s > 0 -> (match s.[0] with
        | '{' | '[' | '"' -> Some JSON
        | _ when String.contains content ':' && not (String.contains content '=') -> Some YAML
        | _ when String.contains content '=' -> Some TOML
        | _ -> None)
    | _ -> None in
  match ext_hint, content_hint with
  | Some fmt, _ -> fmt | None, Some fmt -> fmt | None, None -> JSON

Multi-Format Parsing Challenges:

Each format presents unique architectural challenges that must be unified into a common AST:

YAML: Whitespace-sensitive syntax with aggressive type inference ("123" becomes an integer, not a string)
TOML: INI-style sections with strongly-typed values and nested table structures
JSON: Streaming requirements for large arrays while maintaining constant memory usage

The key insight is that despite different surface syntax, all three map to the same core data structures: primitives, lists, and association lists.

Unified Error Handling Architecture:

The system standardizes error reporting across formats with format-specific recovery suggestions:

let standardize_error ~filename ~format = function
  | JSON_Error (line, col, msg) -> { filename; format = JSON; position = Some { line; col };
      message = sprintf "JSON: %s" msg; recovery = "Check for missing commas or quotes" }
  | YAML_Error (line, col, msg) -> { filename; format = YAML; position = Some { line; col };
      message = sprintf "YAML: %s" msg; recovery = "Verify consistent indentation" }
  | TOML_Error (line, col, msg) -> { filename; format = TOML; position = Some { line; col };
      message = sprintf "TOML: %s" msg; recovery = "Check section headers and value types" }

S-Expression Generation Variants:

The output stage demonstrates how the same AST can generate different S-expression dialects:

(* Common Lisp: simple association lists *)
| Ast.Assoc assoc -> 
    let pairs = List.map (fun (k,v) -> Ast.List [Ast.Atom k; to_lisp v]) assoc in
    Ast.List pairs

(* Scheme: explicit constructors *)
| Ast.Assoc assoc -> 
    let pairs = List.map (fun (k,v) -> Ast.List [Ast.Atom "cons"; Ast.Atom k; to_scheme v]) assoc in
    Ast.List (Ast.Atom "list" :: pairs)

Performance and Memory Patterns:

Different formats exhibit distinct performance characteristics:

JSON streaming: Constant memory usage for arbitrarily large arrays through incremental processing
YAML processing: Higher memory overhead due to indentation analysis requirements
TOML parsing: Additional overhead from section preprocessing and table resolution

The automatic streaming threshold of 1MB reflects a practical balance between simplicity for small files and memory efficiency for large datasets.

Transformation Pipelines Across Domains: Patterns in Circuit Compilation and Data Processing