- Add is_causal flag to FmhaKernel constructor
- Mask positions where k_coord > m_coord to -inf (causal attention)
- Combined with D3 SWA mask: both conditions use OR logic
- Same tTMEM_LOADcS coordinate mapping as D3
- const_expr guarded: zero overhead when is_causal=False
- New test: test_d4_causal_mask.py with causal + combined masking