High-efficiency floating-point neural network inference operators for mobile, server, and Web
-
Updated
Mar 2, 2026 - C
High-efficiency floating-point neural network inference operators for mobile, server, and Web
Efficient Deep Learning Systems course materials (HSE, YSDA)
BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.
The Tensor Algebra SuperOptimizer for Deep Learning
Everything you need to know about LLM inference
[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration
Batch normalization fusion for PyTorch. This is an archived repository, which is not maintained.
Optimize layers structure of Keras model to reduce computation time
A set of tool which would make your life easier with Tensorrt and Onnxruntime. This Repo is designed for YoloV3
Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)
[CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
The blog, read report and code example for AGI/LLM related knowledge.
Krasis is a Hybrid LLM runtime which focuses on efficient running of larger models on consumer grade VRAM limited hardware
Optimizing Monocular Depth Estimation with TensorRT: Model Conversion, Inference Acceleration, and 3D Reconstruction
LightTTS is a lightweight TTS inference framework optimized for CosyVoice2 and CosyVoice3, enabling fast and scalable speech synthesis in Python and supports stream and bistream modes.
Official code of Attention-MoA: Enhancing Mixture-of-Agents via Inter-Agent Semantic Attention and Deep Residual Synthesis
Run 70B+ LLMs on a single 4GB GPU — no quantization required.
Learn the ins and outs of efficiently serving Large Language Models (LLMs). Dive into optimization techniques, including KV caching and Low Rank Adapters (LoRA), and gain hands-on experience with Predibase’s LoRAX framework inference server.
cross-platform modular neural network inference library, small and efficient
Dynamic Attention Mask (DAM) generate adaptive sparse attention masks per layer and head for Transformer models, enabling long-context inference with lower compute and memory overhead without fine-tuning.
Add a description, image, and links to the inference-optimization topic page so that developers can more easily learn about it.
To associate your repository with the inference-optimization topic, visit your repo's landing page and select "manage topics."