skills/treesitter-grammar-dev/SKILL.md
Tree-sitter Grammar Development Skill
Overview
This skill provides patterns and practices for working with tree-sitter grammars: writing grammar.js files, compiling parsers, running tests, and debugging parse trees.
Grammar Lifecycle
1. Grammar Definition (grammar.js)
Tree-sitter grammars are written in JavaScript DSL:
module.exports = grammar({
name: 'my_language',
rules: {
source_file: $ => repeat($._statement),
_statement: $ => choice(
$.function_declaration,
$.variable_declaration,
$.expression_statement
),
function_declaration: $ => seq(
'def',
field('name', $.identifier),
field('parameters', $.parameter_list),
':',
field('body', $.block)
),
identifier: $ => /[a-zA-Z_][a-zA-Z0-9_]*/,
block: $ => seq(
'indent',
repeat1($._statement),
'dedent'
)
}
});
Key concepts:
- •
$refers to the grammar rules object - •
seq()= sequence of items in order - •
choice()= one of several alternatives - •
repeat()= zero or more - •
repeat1()= one or more - •
optional()= zero or one - •
field()= named field in AST - •
_prefixedrules = hidden nodes (not in AST)
2. Parser Generation
tree-sitter generate
What this does:
- •Reads
grammar.js - •Generates
src/parser.c(the LR parser tables) - •May generate
src/tree_sitter/<name>.h - •Creates
.node-types.json(AST node schema)
Common errors:
- •Left recursion: Grammar has indirect left recursion (not allowed)
- •Ambiguity: Multiple parse trees possible (resolve with
prec()) - •Conflicts: Shift/reduce or reduce/reduce conflicts
3. Custom Scanner (Optional)
For context-sensitive lexing (indentation, heredocs, string interpolation):
C scanner (src/scanner.c):
#include <tree_sitter/parser.h>
enum TokenType {
INDENT,
DEDENT,
NEWLINE
};
void *tree_sitter_my_language_external_scanner_create() {
return NULL;
}
void tree_sitter_my_language_external_scanner_destroy(void *payload) {}
unsigned tree_sitter_my_language_external_scanner_serialize(
void *payload,
char *buffer
) {
return 0;
}
void tree_sitter_my_language_external_scanner_deserialize(
void *payload,
const char *buffer,
unsigned length
) {}
bool tree_sitter_my_language_external_scanner_scan(
void *payload,
TSLexer *lexer,
const bool *valid_symbols
) {
// Custom lexing logic here
if (valid_symbols[INDENT]) {
// Check if we should emit INDENT token
}
return false;
}
C++ scanner (src/scanner.cc):
Same API but can use C++ features (vectors, strings, etc.)
4. Compilation
Linux:
gcc -shared -fPIC -O2 -I./src src/parser.c -o parser.so # With scanner: gcc -shared -fPIC -O2 -I./src src/parser.c src/scanner.c -o parser.so
macOS:
gcc -dynamiclib -fPIC -O2 -I./src src/parser.c -o parser.so
With C++ scanner:
g++ -shared -fPIC -O2 -I./src src/parser.c src/scanner.cc -o parser.so
5. Testing
Corpus tests (test/corpus/*.txt):
==================
Function declaration
==================
def greet(name):
print(name)
---
(source_file
(function_declaration
name: (identifier)
parameters: (parameter_list
(identifier))
body: (block
(expression_statement
(call
function: (identifier)
arguments: (argument_list
(identifier)))))))
Format:
- •
==================separates test cases - •Test name on line after first separator
- •Source code to parse
- •
---separator - •Expected AST in S-expression format
Run tests:
tree-sitter test # Or specific test: tree-sitter test -f "function declaration"
6. Interactive Parsing
# Parse file and show tree tree-sitter parse example.txt # Parse to JSON tree-sitter parse example.txt --json # Use custom grammar tree-sitter parse example.txt --language ./parser.so
7. Highlighting (Optional)
Query files for syntax highlighting in queries/highlights.scm:
; Highlighting queries (function_declaration name: (identifier) @function) (call function: (identifier) @function.call) (string) @string (number) @number (comment) @comment ["def" "class" "return"] @keyword
Common Patterns
Precedence and Associativity
Left associative operators:
expression: $ => choice( $.binary_expression, $.primary_expression ), binary_expression: $ => prec.left(choice( prec.left(10, seq($.expression, '+', $.expression)), prec.left(10, seq($.expression, '-', $.expression)), prec.left(20, seq($.expression, '*', $.expression)), prec.left(20, seq($.expression, '/', $.expression)) ))
Right associative:
assignment: $ => prec.right(seq( $.identifier, '=', $.expression ))
Field Names for AST Navigation
function_call: $ => seq(
field('function', $.identifier),
'(',
field('arguments', optional($.argument_list)),
')'
)
Benefits:
- •Semantic clarity in AST
- •Easier tree-sitter query writing
- •Better error messages
External Tokens for Context
When lexing depends on parser state:
module.exports = grammar({
name: 'my_language',
externals: $ => [
$.indent,
$.dedent,
$.newline
],
// ... rest of grammar
});
External scanner handles these tokens.
Hiding Intermediate Nodes
Prefix with _ to hide from final AST:
_statement: $ => choice( $.return_statement, $.expression_statement, $.declaration )
Result: AST has return_statement, not _statement nodes.
Debugging Strategies
Parse Tree Inspection
# View full parse tree tree-sitter parse file.txt # Look for ERROR nodes tree-sitter parse file.txt | grep ERROR
Common issues:
- •ERROR nodes indicate failed parse
- •Check surrounding context for grammar gaps
Conflict Resolution
tree-sitter generate --log
Read conflict messages:
- •Shift/reduce: Parser unsure whether to shift token or reduce rule
- •Reduce/reduce: Multiple rules could apply
Solutions:
- •Add
prec()to disambiguate - •Rewrite grammar to avoid ambiguity
- •Use external scanner for context
Test-Driven Development
- •Write failing test in
test/corpus/ - •Run
tree-sitter testto see mismatch - •Adjust grammar
- •Run
tree-sitter generate - •Repeat until test passes
Binary Search for Regressions
If tests suddenly fail:
- •Binary search through recent commits
- •Compare
.node-types.jsonbetween versions - •Check for breaking changes in tree-sitter CLI version
Performance Considerations
Parse Speed
- •Avoid exponential blowup in
repeat(choice(...))patterns - •Use external scanner for complex lexing (faster than parser)
- •Profile with
tree-sitter parse --time file.txt
Memory Usage
- •Large grammars produce large parser tables (src/parser.c size)
- •Consider splitting very complex grammars
- •Test with realistic file sizes, not just toy examples
Integration with Grammatic
Expected workflow:
- •Grammar lives in
grammars/<name>/as git submodule - •Run
just generate <name>to create parser.c - •Run
just build <name>to compile .so - •Run
just test-grammar <name>for corpus tests - •Run
just parse <name> file.txtfor ad-hoc testing
Build script handles:
- •Scanner detection (C vs C++)
- •Compiler selection (gcc vs g++)
- •Platform-specific flags
- •Output to
build/<name>.so
Log writer captures:
- •Grammar commit hash (version tracking)
- •Build success/failure
- •Build time
- •Parse node count
- •Parse errors
References
- •Tree-sitter docs: https://tree-sitter.github.io/tree-sitter/
- •Creating parsers guide: https://tree-sitter.github.io/tree-sitter/creating-parsers
- •Example grammars: https://github.com/tree-sitter
- •Query syntax: https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries