fix: use env-only SM100 workaround for vLLM PDL/MMA path#4035
fix: use env-only SM100 workaround for vLLM PDL/MMA path#4035danielhanchen wants to merge 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bf31695347
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| os.environ.setdefault("VLLM_LORA_DISABLE_PDL", "1") | ||
| os.environ.setdefault("TRITON_DISABLE_PDL", "1") | ||
| os.environ.setdefault("VLLM_USE_FBGEMM", "0") |
There was a problem hiding this comment.
Apply SM100 workaround before importing vLLM
This change relies only on environment variables (VLLM_LORA_DISABLE_PDL, TRITON_DISABLE_PDL, VLLM_USE_FBGEMM) but unsloth currently imports vLLM earlier via fix_vllm_guided_decoding_params() in the normal init flow (unsloth/__init__.py calls it before fix_vllm_pdl_blackwell), so these setdefault calls can run too late to affect vLLM’s initialization path on SM100. In that common import order, the previous post-import monkey-patch safety net is gone and users can still hit the Blackwell PDL/MMA crash the workaround is meant to prevent.
Useful? React with 👍 / 👎.
Summary of ChangesHello @danielhanchen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refines the vLLM workaround for SM100 GPUs by transitioning from an intrusive monkey-patching approach to a more robust and less disruptive method utilizing environment variables. This change ensures continued compatibility and stability for vLLM on Blackwell architectures while minimizing modifications to vLLM's internal code, providing a cleaner and more maintainable solution. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the vLLM workaround for SM100 GPUs by replacing intrusive monkey-patching with setting environment variables. This is a significant improvement for maintainability and robustness, as it uses official toggles and allows users to override the settings. The changes are well-documented and the logic is sound. I have one minor suggestion to improve the readability of a log message.
| logger.info( | ||
| f"Unsloth: Applied SM100 ({sm100_gpu_name}) vLLM workaround via env vars: " | ||
| f"VLLM_LORA_DISABLE_PDL=1, TRITON_DISABLE_PDL=1, VLLM_USE_FBGEMM=0" | ||
| ) |
There was a problem hiding this comment.
For better readability in logs, consider formatting this long log message across multiple lines. A single, very long line can be hard to read in terminal outputs and may get truncated.
| logger.info( | |
| f"Unsloth: Applied SM100 ({sm100_gpu_name}) vLLM workaround via env vars: " | |
| f"VLLM_LORA_DISABLE_PDL=1, TRITON_DISABLE_PDL=1, VLLM_USE_FBGEMM=0" | |
| ) | |
| logger.info( | |
| f"Unsloth: Applied SM100 ({sm100_gpu_name}) vLLM workaround via env vars:" | |
| f"\n - VLLM_LORA_DISABLE_PDL=1\n - TRITON_DISABLE_PDL=1\n - VLLM_USE_FBGEMM=0" | |
| ) |
Summary
On SM100 (B200/B100), switch the vLLM workaround in
fix_vllm_pdl_blackwell()to env vars only and remove runtime monkey-patching of vLLM internals.This keeps vLLM enabled while avoiding intrusive patching behavior.
What changed
unsloth/import_fixes.pyfix_vllm_pdl_blackwell():vllm.lora.ops.triton_ops.utils.supports_pdlvllm.lora.ops.triton_ops.lora_expand_op.supports_pdlvllm.lora.ops.triton_ops.lora_shrink_op.supports_pdlvllm.lora.ops.triton_ops.fused_moe_lora_op.supports_pdlsetdefault:VLLM_LORA_DISABLE_PDL=1TRITON_DISABLE_PDL=1VLLM_USE_FBGEMM=0Arch conditional MMA instruction used without targeting appropriate compute capabilityWhy
Validation
1) Import-time env probe on B200
Log:
temp/envpr_clean/import_probe.logimport unsloth:VLLM_LORA_DISABLE_PDL=NoneTRITON_DISABLE_PDL=NoneVLLM_USE_FBGEMM=Noneimport unsloth:VLLM_LORA_DISABLE_PDL='1'TRITON_DISABLE_PDL='1'VLLM_USE_FBGEMM='0'Re-check after final comment cleanup:
temp/envpr_clean/import_probe_post_comment_fix.log2) Actual training runs with the patch
Script:
temp/trunc_call_training_probe_forced.pytransformers==5.0.0temp/envpr_clean/train_tf500.logRESULT_JSON:train_runtime=6.1545,train_loss=1.863374924659729transformers==4.57.6temp/envpr_clean/train_tf4576.logRESULT_JSON:train_runtime=5.9139,train_loss=1.8633749246597293) Error-string scan
Searched in
temp/envpr_clean/*.log:Arch conditional MMACUTE_INVALID_CONTROL_PATHTrying to use tmaResult: no matches.
4) Transformers initialization audit (other inits)
Requested check: whether other inits should be upcast to float32.
transformers==5.0.0transformers/initialization.py_is_hf_initializedguard.trunc_normal_delegates totorch.nn.init.trunc_normal_directly.transformers==4.57.6transformers.initializationmodule.transformers/models/phi4_multimodal/modeling_phi4_multimodal.pytransformers/models/siglip/modeling_siglip.pytransformers/models/siglip2/modeling_siglip2.py_trunc_normal_andvariance_scaling_in the tensor dtype.transformers/models/vjepa2/modeling_vjepa2.pyalready includes an explicit float32 upcast helper (trunc_normal_f32_) before cast back.Conclusion: no additional global overload needed for other inits in this change.
LoRA impact