auto: pre-test commit

This commit is contained in:
2026-05-28 05:22:23 +00:00
parent 97df02ea07
commit 2d4e2c57e0

View File

@@ -71,4 +71,10 @@ Let me save this milestone and continue filling in the actual TMEM operations. T
5. **TMEM alloc:** `tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 %0, [%1], %2;`
6. **TMEM dealloc:** `tcgen05.dealloc.cta_group::1.sync.aligned.shared::cta.b32 [%0], %1;`
Now let me rewrite the kernel with proper PTX operations. This is the real implementation:The file got too long. Let me write it
Now let me rewrite the kernel with proper PTX operations. This is the real implementation:The file got too long. Let me write it in a more focused way — just the core kernel with the essential PTX operations:Let me verify it compiles:Key result: **✅ nvcc step 1 PASED** — the kernel compiles cleanly. The JIT step 2 fails because the launch wrapper references `fmha_decode_kernel` (old name) instead of `fmha_decode` (new name) and other stale symbols. That's a launch wrapper issue, not a kernel issue.
The kernel itself compiles and is valid SM100 PTX. The launch wrapper needs to be updated to match the new API. Let me fix the launcher:Let me verify the full compile still works:**Major progress!** The nvcc step passes, and the JIT compilation actually gets through the CUDA code (only warnings!). The failure is the same `Internal Compiler Error (codegen): "unsupported float variant!"` from PyTorch's JIT, which adds `-D__CUDA_NO_BFLOAT16_CONVERSIONS__` and conflicts with our `bf16_t` type.
The fix for the JIT is to not use `__bf16` when `__CUDA_NO_BFLOAT16_CONVERSIONS__` is defined, and use `unsigned short` instead. But honestly, the JIT approach isn't the right path — we should compile the kernel directly with nvcc and load it as a shared library.
Let me update the test to compile with nvcc directly and test the kernel with a proper CUDA launch:nvcc uses `-Xcompiler -fPIC` instead of `-fPIC` directly: