5012703bad
fix: add SwiGLU clamping to fused kernel (paper §4.2.3, CG-1)
...
The fused SwiGLU kernel stored swiglu_limit but never applied it.
Paper §4.2.3: gate capped at swiglu_limit, linear clamped to [-limit, +limit].
Non-fused reference path already applies clamping correctly.
Fix: add fmin/fmax clamping in FP32 before BF16 conversion.
2026-05-23 06:32:54 +00:00
3fb3c925af
Restructure: cutedsl/ -> dsv4/ with proper layering
...
- Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py
- Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc.
- Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda)
- Moved PyTorch bridges to dsv4/ops/
- Moved nn.Module layers to dsv4layers/
- Moved reference implementations to dsv4/reference/
- Moved vendored CUTLASS code to vendored/
- Archived ~190 debug tests to tests/archive/
- Kept ~15 canonical tests in tests/unit/
- Updated all import paths
- Added stubs for future components (model/, cache/, loader/)
- Updated pyproject.toml: dsv4-inference package name
2026-05-21 17:30:44 +00:00