auto: pre-test commit

2026-05-28 03:49:03 +00:00
parent 6f94925491
commit 0ecb98daee
1 changed files with 79 additions and 1 deletions
--- a/NVFP4-1.1_APPROACH_PART_2.md
+++ b/NVFP4-1.1_APPROACH_PART_2.md
@@ -114,4 +114,82 @@ Let me design this more carefully and start writing the kernel modifications. Bu

 Let me look at how the existing codebase writes scalar values to GMEM from within a kernel. The answer is: they don't. All writes go through `cute.copy` with TMA or regular copy atoms. For the FP4 output, I need to construct proper CuTe tensors and use `cute.copy`.

-Let me redesign the test kernel to use `cute.copy` with a `CopyUniversalOp` atom for the output writes:
+Let me redesign the test kernel to use `cute.copy` with a `CopyUniversalOp` atom for the output writes:The error is: `vals_f32[i] = ...` — CuTeDSL can't use a dynamic loop variable `i` as a Python list index. The `i` from `cutlass.range` is an `ArithValue` (MLIR SSA value), not a Python int. I can't use it to index into a Python list.
+
+This is a fundamental issue with my approach. The `fp4_quant.py` functions use Python lists like `vals_f32 = [cutlass.Float32(0.0)] * 16` and index them with `vals_f32[i]`. But CuTeDSL's `i` is an MLIR value, not a Python int.
+
+I need to use CuTe register tensors instead of Python lists. Or I need to use `cute.make_rmem_tensor` for the accumulator buffer.
+
+Actually, looking at the existing kernel code, register tensors are used with `cute.make_rmem_tensor`. The SwiGLU epilogue uses `silu_gate_buf = cute.make_rmem_tensor(tiled_copy_r2s.retile(tTR_rAcc).shape, self.c_dtype)`. I should use the same pattern for the SwiGLU accumulation buffer.
+
+But for the test kernel, I need a simpler approach. Instead of accumulating in a Python list, I should process the 16 elements one at a time in the loop, computing amax, then doing a second pass for E2M1 quantization.
+
+Let me rewrite the test kernel to avoid Python list indexing with CuTeDSL variables:OK, I'm going in circles on the test kernel's I/O. Let me step back and think about this properly.
+
+The real question is: **how does the actual fused_swiglu kernel write to GMEM?** It uses `cute.copy(tma_atom_c, bSG_sC, bSG_gC)` with a TMA store atom. The BF16 data goes SMEM → GMEM via TMA.
+
+For the FP4 output, I need a similar path: registers → SMEM → GMEM. The FP4 data is 9 bytes per 16-element block (8 FP4 + 1 SF). In the actual kernel, I'd write to SMEM and then do a TMA store or a regular GMEM store.
+
+But for the test, I don't need TMA. I just need to write 9 Int32 values (8 FP4 bytes + 1 SF byte) to GMEM. The simplest approach that WORKS in CuTeDSL:
+
+1. Create a register tensor with the 8 FP4 bytes and 1 SF byte
+2. Copy it to GMEM using `cute.copy` with a simple copy atom
+
+Let me look at how `cute.copy` is used with `CopyUniversalOp` in the codebase:OK so the pattern is: `cute.copy(copy_atom, rmem_tensor, gmem_tensor)`. The copy atom uses `CopyUniversalOp()` with `num_bits_per_copy=128` for 4 × Int32 (128 bits). For my test, I need 8 × Int32 for FP4 + 1 × Int32 for SF.
+
+But actually, I realize there's a much simpler approach. Since the FP4 data will be written to GMEM in the actual kernel, I should design the GMEM write pattern to match how the L2 GEMM reads it. The L2 GEM reads `padded_activated_fp4` which is `(tokens, intermediate//2)` in `float4_e2m1fn_x2` format with scales `(tokens, intermediate//16)` in `float8_e4m3fn`.
+
+For the test, I just need to verify the math works on GPU. Let me use a completely different approach: **compute the FP4 quantization in the kernel, write the result back as BF16 (just the dequantized values), and compare with the original BF16 after round-trip.** This avoids the FP4 byte write entirely.
+
+Actually no, that doesn't verify the bit-exact FP4 output. Let me try yet another approach: write the FP4 bytes into a BF16 output tensor (packing 2 bytes per BF16 value) and unpack on the CPU side.
+
+OK, I'm overcomplicating this. Let me just check if `cute.arch.store` exists by trying a different way. The existing codebase uses `cute.arch.load(ptr, dtype)` for scalar loads. Maybe there's an equivalent `cute.arch.store` or `cute.arch.global_store` that I can discover on B200.
+
+Actually, looking at the CUTLASS Python source more carefully, there should be a `cute.arch.store` — it's the counterpart of `cute.arch.load`. Let me check the CUTLASS Python documentation or source:The docs don't show `cute.arch.store` — they only show thread_idx, barriers, SMEM/TMEM management, etc. The `cute.arch.load` used in the codebase must be from a different module or a custom wrapper. Let me check:So `cute.arch.load(ptr, dtype)` loads from a GMEM pointer. There must be a corresponding `cute.arch.store`. Let me search the CUTLASS GitHub:There's a GitHub issue about `cute.arch.load` with BFloat16. And from issue #3266, `cute.arch.load` exists but has issues with BFloat16. Let me check if `cute.arch.store` exists:Let me just skip the CuTeDSL kernel test for now and instead verify the math works by running it within the actual fused SwiGLU kernel. The FP4 quantization math is already verified in Python (0/500 failures). The CuTeDSL kernel test is just to verify that the CuTeDSL implementation compiles and produces the same results on GPU.
+
+Instead of fighting with `cute.arch.store`, let me move directly to integrating the FP4 quantization into the actual SwiGLU epilogue. I'll write the FP4 bytes to SMEM and then to GMEM using the same patterns the MoE kernel already uses.
+
+Let me focus on the kernel modification now:I see — the kernel signature doesn't have `fp4_out`, `sf_out`, `l2_global_scale` yet. These were added to `__call__` but never wired to `kernel()`. I need to add them.
+
+This is a significant change to the kernel. Let me plan it carefully:
+
+**Changes needed to `fused_swiglu.py`:**
+
+1. Add `fp4_out`, `sf_out`, `l2_global_scale` parameters to `kernel()`
+2. In the epilogue warp section, when `fp4_mode=True`:
+   - For gate subtiles: compute SiLU(gate), save to `silu_gate_buf`. **Skip SMEM write and TMA store.**
+   - For up subtiles: compute SwiGLU result. **Save to a `swiglu_buf` register buffer** (2 consecutive up subtiles = 16 values).
+   - After every 2 up subtiles: compute amax + FP8 E4M3 scale + E2M1 pack. **Write FP4 bytes to SMEM, then to GMEM** (via TMA store to the FP4 output tensor).
+3. Write FP8 scale factors similarly.
+4. When `fp4_mode=False`: keep existing BF16 path unchanged.
+
+The biggest challenge is the GMEM write for FP4. Let me look at how TMA stores work for the C tensor:The TMA store for C is:
+1. `tma_atom_c, tma_tensor_c = cpasync.make_tiled_tma_atom(c_tma_op, c_gemm, epi_smem_layout, epi_tile)` — create TMA atom for C
+2. `bSG_sC, bSG_gC = cpasync.tma_partition(...)` — partition for TMA store
+3. `cute.copy(tma_atom_c, bSG_sC, bSG_gC, tma_desc_ptr=desc_ptr_c)` — do the TMA store
+
+For the FP4 output, I'd need a similar TMA setup. But the FP4 output tensor has a different shape and dtype than the BF16 C tensor. This is complex and requires careful layout design.
+
+Let me take a more pragmatic (but still correct) approach: **write the FP4 bytes directly from registers to GMEM using `cute.copy` with a non-TMA copy atom.** The FP4 output doesn't need TMA because:
+1. The write is small (9 bytes per 16-element block)
+2. The FP4 output tensor is contiguous and simple
+3. TMA setup for FP4 is complex and may not be needed for correctness
+
+The simplest write pattern: each thread computes its FP4 bytes and writes them to GMEM via a `CopyUniversalOp` copy atom. But I need to figure out how to construct the GMEM tensor for the FP4 output.
+
+Actually, let me look at how the `cute.arch.load` call works — it uses raw pointer arithmetic. For the FP4 output, I can do the same: compute the GMEM address from the work tile info and write using a similar pattern.
+
+But `cute.arch.store` doesn't exist. Let me check if there's a `cute.copy` that works with raw pointers:
+
+Wait — actually, the `cute.arch.load` in the codebase is used for scalar loads from GMEM. If there's a `cute.arch.load`, there should be a `cute.arch.store` — they're symmetric operations. The fact that it's not in the docs doesn't mean it doesn't exist. It might just be undocumented.
+
+Let me just try it on B200. If `cute.arch.store` doesn't exist, I'll adapt. But I need to first get the test kernel to compile. The previous attempt failed because `cute.arch.store` was used. Let me try a different approach for the test: write the results to the output tensor via `cute.copy` with a simple copy atom, using proper CuTe tensor construction.
+
+Actually, I just realized I can look at the B200's CUTLASS Python installation to check:**`cute.arch.store` EXISTS on the B200!** The test just failed because of the Python list indexing issue, not because `cute.arch.store` doesn't exist. Let me fix the test kernel properly and rerun:Two errors found:
+1. `cute.math.fmax` doesn't exist — the correct API is `cute.math.fmin` and `cute.arch.fmax` (based on the README's CuTeDSL constraint #4)
+2. The `fp4_quant.py` functions also use `cute.math.fmax` which won't work
+
+Looking at the README:
+> `cute.arch.fmax` is impure for the vectorizer. Use it inside plain `range()`, never inside `vectorize=True`.
+
+So it's `cute.arch.fmax`, not `cute.math.fmax`! And for `fmin`, it's `cute.math.fmin`. Let me check the codebase for the exact APIs:So `cute.arch.fmax` exists, and `cute.math.fmin` exists. But `cute.math.fmax` does NOT exist. Let me fix `fp4_quant.py` to use `cute.arch.fmax` instead of `cute.math.fmax`:Good, all `cute.math.fmax` replaced with `cute.arch.fmax`. Now fix the test kernel too: