diff --git a/NVFP4-1.1_RAW_CUDA_CPP_APPROACH.md b/NVFP4-1.1_RAW_CUDA_CPP_APPROACH.md index 3be5adb6..d4c0bf22 100644 --- a/NVFP4-1.1_RAW_CUDA_CPP_APPROACH.md +++ b/NVFP4-1.1_RAW_CUDA_CPP_APPROACH.md @@ -71,4 +71,10 @@ Let me save this milestone and continue filling in the actual TMEM operations. T 5. **TMEM alloc:** `tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 %0, [%1], %2;` 6. **TMEM dealloc:** `tcgen05.dealloc.cta_group::1.sync.aligned.shared::cta.b32 [%0], %1;` -Now let me rewrite the kernel with proper PTX operations. This is the real implementation:The file got too long. Let me write it \ No newline at end of file +Now let me rewrite the kernel with proper PTX operations. This is the real implementation:The file got too long. Let me write it in a more focused way — just the core kernel with the essential PTX operations:Let me verify it compiles:Key result: **✅ nvcc step 1 PASED** — the kernel compiles cleanly. The JIT step 2 fails because the launch wrapper references `fmha_decode_kernel` (old name) instead of `fmha_decode` (new name) and other stale symbols. That's a launch wrapper issue, not a kernel issue. + +The kernel itself compiles and is valid SM100 PTX. The launch wrapper needs to be updated to match the new API. Let me fix the launcher:Let me verify the full compile still works:**Major progress!** The nvcc step passes, and the JIT compilation actually gets through the CUDA code (only warnings!). The failure is the same `Internal Compiler Error (codegen): "unsupported float variant!"` from PyTorch's JIT, which adds `-D__CUDA_NO_BFLOAT16_CONVERSIONS__` and conflicts with our `bf16_t` type. + +The fix for the JIT is to not use `__bf16` when `__CUDA_NO_BFLOAT16_CONVERSIONS__` is defined, and use `unsigned short` instead. But honestly, the JIT approach isn't the right path — we should compile the kernel directly with nvcc and load it as a shared library. + +Let me update the test to compile with nvcc directly and test the kernel with a proper CUDA launch:nvcc uses `-Xcompiler -fPIC` instead of `-fPIC` directly: \ No newline at end of file