auto: pre-test commit

2026-05-28 05:22:23 +00:00
parent 97df02ea07
commit 2d4e2c57e0
1 changed files with 7 additions and 1 deletions
--- a/NVFP4-1.1_RAW_CUDA_CPP_APPROACH.md
+++ b/NVFP4-1.1_RAW_CUDA_CPP_APPROACH.md
@@ -71,4 +71,10 @@ Let me save this milestone and continue filling in the actual TMEM operations. T
 5. **TMEM alloc:** `tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 %0, [%1], %2;`
 6. **TMEM dealloc:** `tcgen05.dealloc.cta_group::1.sync.aligned.shared::cta.b32 [%0], %1;`

-Now let me rewrite the kernel with proper PTX operations. This is the real implementation:The file got too long. Let me write it
+Now let me rewrite the kernel with proper PTX operations. This is the real implementation:The file got too long. Let me write it in a more focused way — just the core kernel with the essential PTX operations:Let me verify it compiles:Key result: **✅ nvcc step 1 PASED** — the kernel compiles cleanly. The JIT step 2 fails because the launch wrapper references `fmha_decode_kernel` (old name) instead of `fmha_decode` (new name) and other stale symbols. That's a launch wrapper issue, not a kernel issue.
+
+The kernel itself compiles and is valid SM100 PTX. The launch wrapper needs to be updated to match the new API. Let me fix the launcher:Let me verify the full compile still works:**Major progress!** The nvcc step passes, and the JIT compilation actually gets through the CUDA code (only warnings!). The failure is the same `Internal Compiler Error (codegen): "unsupported float variant!"` from PyTorch's JIT, which adds `-D__CUDA_NO_BFLOAT16_CONVERSIONS__` and conflicts with our `bf16_t` type.
+
+The fix for the JIT is to not use `__bf16` when `__CUDA_NO_BFLOAT16_CONVERSIONS__` is defined, and use `unsigned short` instead. But honestly, the JIT approach isn't the right path — we should compile the kernel directly with nvcc and load it as a shared library.
+
+Let me update the test to compile with nvcc directly and test the kernel with a proper CUDA launch:nvcc uses `-Xcompiler -fPIC` instead of `-fPIC` directly: