Add MXFP8 attention by cyanguwa · Pull Request #2719 · NVIDIA/TransformerEngine

cyanguwa · 2026-03-01T21:40:10Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

This reverts commit d9ff566. Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

greptile-apps · 2026-03-01T21:58:59Z

Greptile Summary

This PR adds MXFP8 (Microscaling FP8) attention support to TransformerEngine, introducing a new quantization scheme with block-level scaling for attention operations.

Key Changes

Added MXFP8 scaling mode with FP8_E8M0 tensor reordering and block-level quantization
Introduced new BHSD_BHSD_BHSD layout for MXFP8 format with proper stride calculations
Extended fused attention APIs to support separate output formats (o_format, d_out_format, dqkv_layout)
Updated cuDNN frontend submodule for MXFP8 support (requires cuDNN >= 9.21.0, SM >= 10.0)
Added MXFP8BlockScaling recipe alongside existing DelayedScaling and CurrentScaling

Critical Issues Found

Multiple debug print statements left in production code across context_parallel.py, utils.py, grouped_tensor.py, and test files - will spam logs
Commented-out test assertions in test_attention.py (lines 2241-2251) - backward pass validation disabled, tests won't catch regressions
Commented-out environment variable NVTE_ALLOW_NONDETERMINISTIC_ALGO in tests - may cause test failures or unexpected behavior

Architecture

The implementation splits QK and V dimensions (d_qk vs d_v) to support Multi-Latent Attention (MLA), adds GroupedTensor storage for MXFP8 quantization with columnwise scaling, and extends backend selection logic to route MXFP8 workloads to appropriate kernels based on hardware capabilities.

Confidence Score: 2/5

Not safe to merge - contains debug code and disabled test assertions
Score reflects critical issues: production code has debug print statements that will spam logs, and backward pass test assertions are commented out meaning the tests won't catch regressions in MXFP8 backward pass correctness
Pay close attention to tests/pytorch/attention/test_attention.py (disabled assertions), transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py (3 print statements), and transformer_engine/pytorch/attention/dot_product_attention/utils.py (print + hard assertions)

Important Files Changed

Filename	Overview
tests/pytorch/attention/test_attention.py	Test file with debug print statements and commented-out backward pass assertions for MXFP8 testing
transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py	Major changes for MXFP8 support with multiple debug print statements left in code
transformer_engine/pytorch/attention/dot_product_attention/utils.py	MXFP8 quantizer setup and format conversion with debug print statement in combine_and_quantize
transformer_engine/common/fused_attn/fused_attn_fp8.cu	Core CUDA implementation for MXFP8 attention with FP8_E8M0 tensor reordering
transformer_engine/common/fused_attn/utils.h	Added BHSD_BHSD_BHSD layout support with proper stride calculations
transformer_engine/pytorch/tensor/storage/grouped_tensor.py	GroupedTensor storage with debug print statements during quantization

_{Last reviewed commit: d6ecadc}

greptile-apps

_{33 files reviewed, 9 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-03-01T21:59:03Z

tests/pytorch/attention/test_attention.py

+                print(f"fused_attn_bwd_fp8[{i}].max(): {fused_attn_bwd_fp8[i].max()}, fused_attn_bwd_f16[{i}].max(): {fused_attn_bwd_f16[i].max()}")
+                print(f"fused_attn_bwd_fp8[{i}].min(): {fused_attn_bwd_fp8[i].min()}, fused_attn_bwd_f16[{i}].min(): {fused_attn_bwd_f16[i].min()}")
+                # compare_and_assert(
+                #     fused_attn_bwd_fp8[i],
+                #     fused_attn_bwd_f16[i],
+                #     f"fused_attn_bwd_fp8[{i}]",
+                #     f"fused_attn_bwd_f16[{i}]",
+                #     atol,
+                #     rtol,
+                #     rmse_tol,
+                #     True,


Backward pass assertions commented out and replaced with debug prints - tests won't catch regressions

Suggested change

print(f"fused_attn_bwd_fp8[{i}].max(): {fused_attn_bwd_fp8[i].max()}, fused_attn_bwd_f16[{i}].max(): {fused_attn_bwd_f16[i].max()}")

print(f"fused_attn_bwd_fp8[{i}].min(): {fused_attn_bwd_fp8[i].min()}, fused_attn_bwd_f16[{i}].min(): {fused_attn_bwd_f16[i].min()}")

# compare_and_assert(

# fused_attn_bwd_fp8[i],

# fused_attn_bwd_f16[i],

# f"fused_attn_bwd_fp8[{i}]",

# f"fused_attn_bwd_f16[{i}]",

# atol,

# rtol,

# rmse_tol,

# True,

compare_and_assert(

fused_attn_bwd_fp8[i],

fused_attn_bwd_f16[i],

f"fused_attn_bwd_fp8[{i}]",

f"fused_attn_bwd_f16[{i}]",

atol,

rtol,

rmse_tol,

True,

)

greptile-apps · 2026-03-01T21:59:04Z

transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py

        q_fp8, k_fp8, v_fp8 = (None, None, None)
        # communicate for the 'a2a' part of 'a2a+p2p'
        if cp_size_a2a > 1:
+            print(f">>>>>>======================>>>>>> {torch.cuda.current_device()}: fp8: {fp8}, is_input_fp8: {is_input_fp8}, fp8_recipe.mxfp8(): {fp8_recipe.mxfp8()}")


Debug print statement left in production code

Suggested change

print(f">>>>>>======================>>>>>> {torch.cuda.current_device()}: fp8: {fp8}, is_input_fp8: {is_input_fp8}, fp8_recipe.mxfp8(): {fp8_recipe.mxfp8()}")

greptile-apps · 2026-03-01T21:59:05Z

transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py

        # fp8 attention:    q, k, v: torch.Tensor, dtype=torch.uint8
        out = None
+        o_format = qkv_format
        for i in range(cp_size + 1):


Debug print statement left in production code

Suggested change

for i in range(cp_size + 1):

greptile-apps · 2026-03-01T21:59:06Z

transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py

                            softmax_lse_per_step[0],
                            seq_dim,
                        )
+                        print(f"====o/v===== {torch.cuda.current_device()}: i: {i}, {enable_mla}, out.shape: {out.shape} {out_per_step[0].shape} {v_shape} {o_shape}")


Debug print statement left in production code

Suggested change

print(f"====o/v===== {torch.cuda.current_device()}: i: {i}, {enable_mla}, out.shape: {out.shape} {out_per_step[0].shape} {v_shape} {o_shape}")

greptile-apps · 2026-03-01T21:59:07Z

transformer_engine/pytorch/attention/dot_product_attention/utils.py

    dO_quantizer.set_usage(rowwise=True, columnwise=False)
-    dO_quantizer.internal = True
+
    dP_quantizer = quantizers["scaling_bwd"][META_DP]


Debug print statement in combine_and_quantize will spam logs in production

Suggested change

dP_quantizer = quantizers["scaling_bwd"][META_DP]

greptile-apps · 2026-03-01T21:59:14Z

tests/pytorch/attention/test_attention.py


    os.environ["NVTE_FP8_DPA_BWD"] = "1" if fp8_dpa_bwd else "0"
-    os.environ["NVTE_ALLOW_NONDETERMINISTIC_ALGO"] = "1"
+    # os.environ["NVTE_ALLOW_NONDETERMINISTIC_ALGO"] = "1"


NVTE_ALLOW_NONDETERMINISTIC_ALGO commented out - could affect test behavior or cause failures

Suggested change

# os.environ["NVTE_ALLOW_NONDETERMINISTIC_ALGO"] = "1"

os.environ["NVTE_ALLOW_NONDETERMINISTIC_ALGO"] = "1"

greptile-apps · 2026-03-01T21:59:15Z

transformer_engine/pytorch/attention/dot_product_attention/utils.py

    dP_quantizer.interal = True
+    dP_quantizer.set_usage(rowwise=True, columnwise=False)
+
+    dQKV_quantizer = quantizers["scaling_bwd"][META_DQKV]


Hard assertions on sequence/dimension alignment may fail in valid edge cases - consider providing clearer error messages

greptile-apps · 2026-03-01T21:59:18Z

Additional Comments (2)

transformer_engine/pytorch/tensor/storage/grouped_tensor.py
Debug print statement left in production code

tests/pytorch/attention/run_attention_with_cp.py
Debug print statement left in test script

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

for more information, see https://pre-commit.ci

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

for more information, see https://pre-commit.ci

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

cyanguwa added 30 commits January 31, 2026 13:42

initial implementation for mxfp8

e0ae107

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

semi-working FP8; broken F16

23434b5

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

clean up last commit

dbb68b8

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

comment out F16 pass

c627231

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

Merge branch 'NVIDIA:main' into mxfp8_fwd

d27a267

pull in grouped_quantize for MXFP8

3f3b9e6

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

grouped tensor - pytorch

850b16e

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

quantize mxfp8

46f2eb1

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

fix shapes/strides

e86207c

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

fix unfused; clean up

4e854d5

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

split d to d_qk/d_v; attempt at bwd

cd06398

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

merge main

d2a63a1

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

fix last merge

730a472

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

update FE

d9ff566

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

attempt at SWA/MLA

2b264d7

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

remove prints

2008bed

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

remove leftover prints

239f58a

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

Revert "update FE"

f44a775

This reverts commit d9ff566. Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

update FE

965572b

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

fix MLA O strides; add bottom_right_diagonal

91025c7

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

attempt at bwd

d655e7e

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

fix get_quantizers; attempt at bwd

a4ab691

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

fix fprop; add o_format

a85070d

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

attempt at bwd with o_format/d_out_format/dqkv_layout

8909b35

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

fix dtype/o_format/etc in bwd calls

90a636c

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

fix generateMatrixStridesWithFormats and _v1; fix padding for mxfp8

8c72dea

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

fix upon last commit for paddedsizes

5f23edd

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

add mxfp8 env var

18c5580

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

disable FA for mxfp8

6847645

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

add mha test

c5a98d5

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

cyanguwa added 12 commits February 23, 2026 16:26

attempt at bwd; force determinism; fix shapes

7e61ecd

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

remove prints

6d468da

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

update FE

9f8e856

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

update FE from pre-merge branch to post-merge develop

facef79

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

allow MXFP8 linear + f16 attn

fd33cca

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

test cp a2a

5079d55

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

remove prints temporarily

06b7d49

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

test cp p2p

7fbe399

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

minor fixes for mla

aa05a2a

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

open up a2a for mla

00e6693

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

test ag

b8d28ce

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

tweaks for last commit

d6ecadc

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

cyanguwa marked this pull request as ready for review March 1, 2026 21:43

cyanguwa marked this pull request as draft March 1, 2026 21:43

cyanguwa closed this Mar 1, 2026

cyanguwa deleted the add_mxfp8 branch March 1, 2026 21:44

cyanguwa restored the add_mxfp8 branch March 1, 2026 21:50

enable mla ag

3ac48cd

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

greptile-apps bot reviewed Mar 1, 2026

View reviewed changes

merge main

169ae8a

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

cyanguwa reopened this Mar 1, 2026

pre-commit-ci bot and others added 6 commits March 1, 2026 22:36

[pre-commit.ci] auto fixes from pre-commit.com hooks

5d4fa5e

for more information, see https://pre-commit.ci

fix merge

81c18fa

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1f14f2f

for more information, see https://pre-commit.ci

fix merge

ccebe77

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

revert to main grouped tensor impl

c52c5f4

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

minor tweaks to return to main

5b776ec

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MXFP8 attention#2719

Add MXFP8 attention#2719
cyanguwa wants to merge 50 commits intoNVIDIA:mainfrom
cyanguwa:add_mxfp8

cyanguwa commented Mar 1, 2026

Uh oh!

greptile-apps bot commented Mar 1, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Mar 1, 2026

Uh oh!

greptile-apps bot Mar 1, 2026

Uh oh!

greptile-apps bot Mar 1, 2026

Uh oh!

greptile-apps bot Mar 1, 2026

Uh oh!

greptile-apps bot Mar 1, 2026

Uh oh!

greptile-apps bot Mar 1, 2026

Uh oh!

greptile-apps bot Mar 1, 2026

Uh oh!

greptile-apps bot commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	# os.environ["NVTE_ALLOW_NONDETERMINISTIC_ALGO"] = "1"
	os.environ["NVTE_ALLOW_NONDETERMINISTIC_ALGO"] = "1"

Conversation

cyanguwa commented Mar 1, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Mar 1, 2026

Greptile Summary

Key Changes

Critical Issues Found

Architecture

Confidence Score: 2/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant