Skip to content

Parser Pipeline

markymark uses two markdown parsers for different purposes, plus dedicated parsers for structured formats.

The primary path uses md4c, a fast CommonMark parser accessed through Zig bindings in markymark-kernels. This runs on every keystroke in your editor.

The LSP flow:

  1. Editor sends textDocument/didChange with updated text
  2. The Document Engine (a stateful Zig object per document) re-parses via md4c
  3. The ExtractionRenderer callback extracts symbols: headings, links, tags, block IDs, code spans, tasks, and embeds
  4. Results are serialized into a flat binary blob
  5. Rust calls DocumentIndex::from_blob() to build the per-document index

Tree-sitter (secondary — MCP batch, hover)

Section titled “Tree-sitter (secondary — MCP batch, hover)”

Tree-sitter provides a full syntax tree for:

  • MCP batch indexingfrom_ast() extracts frontmatter via tree-sitter, then delegates all content extraction to the Zig scan backend
  • Hover and go-to-definition — precise AST node positions

The Parser struct in markymark-parser/src/lib.rs wraps tree-sitter and produces an Ast — markymark’s internal representation.

Both parsers extract the same symbol types into DocumentIndex:

SymbolExample
Headings## Title with slug, level, position
Wiki links[[target]], [[target|alias]]
Markdown links[text](url#anchor)
Tags#tag
XML tagsCustom XML elements
Block IDs^block-id
Code spans`symbol`
Tasks- [ ] / - [x]
Embeds![[target]]
FrontmatterYAML/TOML key-value pairs
Propertieskey:: value (Logseq)
Callouts> [!type] Title (Obsidian)
Query blocks{{query ...}} (Logseq)
Link definitions[label]: url "title"
Block references((uuid)) (Logseq)

XML tags are a special case — md4c treats HTML as pass-through, so XML extraction runs as a separate scan step after parsing.

Non-markdown files take a separate path through markymark-parser/src/structured/:

FormatParser
JSON, JSONCtree-sitter-json
JSON5json5 crate
JSONLLine-by-line JSON
YAMLtree-sitter-yaml
TOMLtree-sitter-toml-ng
.env, INICustom flat-format parser

These produce a StructuredDocumentIndex that shares the RealmIndex infrastructure but exposes key paths (e.g., server.port) instead of markdown-specific symbols.

markymark-core/src/scanner.rs defines ScanBackend — an abstraction over symbol scanning with two implementations:

  • Md4cScanBackend — md4c FFI via markymark-kernels
  • ZigScanBackend — Zig SIMD scan backend

Both return the same result types, keeping the indexing layer parser-agnostic.