Parser Pipeline
markymark uses two markdown parsers for different purposes, plus dedicated parsers for structured formats.
Two markdown parsers
Section titled “Two markdown parsers”md4c (primary — LSP)
Section titled “md4c (primary — LSP)”The primary path uses md4c, a fast CommonMark
parser accessed through Zig bindings in markymark-kernels. This runs on every
keystroke in your editor.
The LSP flow:
- Editor sends
textDocument/didChangewith updated text - The Document Engine (a stateful Zig object per document) re-parses via md4c
- The
ExtractionRenderercallback extracts symbols: headings, links, tags, block IDs, code spans, tasks, and embeds - Results are serialized into a flat binary blob
- Rust calls
DocumentIndex::from_blob()to build the per-document index
Tree-sitter (secondary — MCP batch, hover)
Section titled “Tree-sitter (secondary — MCP batch, hover)”Tree-sitter provides a full syntax tree for:
- MCP batch indexing —
from_ast()extracts frontmatter via tree-sitter, then delegates all content extraction to the Zig scan backend - Hover and go-to-definition — precise AST node positions
The Parser struct in markymark-parser/src/lib.rs wraps tree-sitter and
produces an Ast — markymark’s internal representation.
Symbol extraction
Section titled “Symbol extraction”Both parsers extract the same symbol types into DocumentIndex:
| Symbol | Example |
|---|---|
| Headings | ## Title with slug, level, position |
| Wiki links | [[target]], [[target|alias]] |
| Markdown links | [text](url#anchor) |
| Tags | #tag |
| XML tags | Custom XML elements |
| Block IDs | ^block-id |
| Code spans | `symbol` |
| Tasks | - [ ] / - [x] |
| Embeds | ![[target]] |
| Frontmatter | YAML/TOML key-value pairs |
| Properties | key:: value (Logseq) |
| Callouts | > [!type] Title (Obsidian) |
| Query blocks | {{query ...}} (Logseq) |
| Link definitions | [label]: url "title" |
| Block references | ((uuid)) (Logseq) |
XML tags are a special case — md4c treats HTML as pass-through, so XML extraction runs as a separate scan step after parsing.
Structured format parsing
Section titled “Structured format parsing”Non-markdown files take a separate path through markymark-parser/src/structured/:
| Format | Parser |
|---|---|
| JSON, JSONC | tree-sitter-json |
| JSON5 | json5 crate |
| JSONL | Line-by-line JSON |
| YAML | tree-sitter-yaml |
| TOML | tree-sitter-toml-ng |
.env, INI | Custom flat-format parser |
These produce a StructuredDocumentIndex that shares the RealmIndex
infrastructure but exposes key paths (e.g., server.port) instead of
markdown-specific symbols.
The ScanBackend trait
Section titled “The ScanBackend trait”markymark-core/src/scanner.rs defines ScanBackend — an abstraction over
symbol scanning with two implementations:
Md4cScanBackend— md4c FFI viamarkymark-kernelsZigScanBackend— Zig SIMD scan backend
Both return the same result types, keeping the indexing layer parser-agnostic.