CANN Operator Generation
Generate optimized NPU kernels by writing solution.json.
Workflow
Step 1: Research APIs
code
cann_get_knowledge() # List all API categories
cann_search_api("DataCopyPad") # Get API signature → returns header_file path
cann_search_operator("avg_pool") # Find similar implementations → returns file paths
IMPORTANT: These tools return file paths, not code. Use the Read tool to view the actual content.
Example workflow:
code
1. cann_search_operator("avg_pool") → {"primary": {"kernel_files": ["/path/to/kernel.h"]}}
2. Read("/path/to/kernel.h") → see actual implementation
Step 2: Read Input Files
- •
constraints.md- Code template structure & JSON format requirements - •
vector.mdorcube.md- Hardware specs & critical rules - •
signature.json- Operator interface (inputs, outputs, params) - •
python_reference.py- Reference implementation - •
solution_template.json- Example JSON format
Step 3: Pre-flight Checks
Before writing code, verify:
Output dimensions alignment:
- •If output width/height is NOT a multiple of 8, you MUST use
DataCopyPadinstead ofDataCopy - •Example: outW=46 → use DataCopyPad to write exactly 46 elements
Unfamiliar APIs:
- •If using an API you're unsure about, call
cann_search_api("ApiName")first - •Then Read the returned header_file to see the exact signature
Step 4: Write solution.json
CRITICAL: Use the Write tool to create solution.json
All 6 fields required:
- •
kernel_impl- Kernel class definition - •
kernel_entry_body- Instantiate and call kernel - •
tiling_fields- JSON array[{"type": "T", "name": "N"}, ...] - •
tiling_func_body- Host-side tiling calculation - •
infer_shape_body- Output shape inference - •
output_alloc_code- C++ codeat::Tensor result = ...;
Quick Reference
Operator Naming
For operator foo_bar:
- •Kernel class:
KernelFooBar - •Tiling class:
FooBarCustomTilingData
Buffer Initialization
cpp
// TQue: 3 arguments (que, BUFFER_NUM, size) TQue<QuePosition::VECIN, 2> inQue; pipe.InitBuffer(inQue, 2, bufferSize); // TBuf: 2 arguments only (buf, size) - NO BUFFER_NUM! TBuf<QuePosition::VECCALC> tmpBuf; pipe.InitBuffer(tmpBuf, scratchSize);
Data Transfer
cpp
// ✅ Always use DataCopy for GM↔UB (uses DMA) DataCopy(localTensor, xGm[offset], count); // ⚠️ Avoid GetPhyAddr() for data transfer - very slow // float val = *((__gm__ float*)xGm.GetPhyAddr() + idx); // ~100x slower!
Error Troubleshooting
| Error | Likely Cause |
|---|---|
TILING_DATA_FIELD_DEF requires 2 arguments | tiling_fields is string, should be JSON array |
OpCustomTilingData not declared | Wrong tiling class name, use {OpName}CustomTilingData |
InitBuffer not supports T as TBuf | TBuf uses 2 args: pipe.InitBuffer(buf, size) |
Output value mismatch | Non-aligned output → use DataCopyPad |
507035 vector core exception | Vector operation count < 8 |