inference-optimization

Here are 78 public repositories matching this topic...

google / XNNPACK

High-efficiency floating-point neural network inference operators for mobile, server, and Web

cpu neural-network inference multithreading simd matrix-multiplication neural-networks convolutional-neural-networks convolutional-neural-network inference-optimization mobile-inference

Updated Mar 2, 2026
C

mryab / efficient-dl-systems

Star

Efficient Deep Learning Systems course materials (HSE, YSDA)

machine-learning deep-learning cuda pytorch performance-optimization distributed-training ml-infrastructure mlops inference-optimization efficient-deep-learning ml-systems

Updated Feb 26, 2026
Jupyter Notebook

alibaba / BladeDISC

Star

BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.

machine-learning deep-learning neural-network compiler tensorflow pytorch inference-optimization mlir

Updated Dec 30, 2024
C++

jiazhihao / TASO

Star

The Tensor Algebra SuperOptimizer for Deep Learning

deep-neural-networks deep-learning inference-optimization

Updated Jan 26, 2023
C++

bentoml / llm-inference-handbook

Star

Everything you need to know about LLM inference

inference-optimization llm llm-inference inference-handbook inference-infrastructure

Updated Mar 2, 2026
TypeScript

mit-han-lab / inter-operator-scheduler

Star

[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration

acceleration cnn parallelism inference-optimization

Updated Apr 27, 2022
C++

imedslab / pytorch_bn_fusion

Star

Batch normalization fusion for PyTorch. This is an archived repository, which is not maintained.

deep-neural-networks deep-learning pytorch batch-normalization inference-optimization

Updated Apr 6, 2020
Python

ZFTurbo / Keras-inference-time-optimizer

Star

Optimize layers structure of Keras model to reduce computation time

keras inference-optimization

Updated Jul 18, 2020
Python

Rapternmn / PyTorch-Onnx-Tensorrt

Star

A set of tool which would make your life easier with Tensorrt and Onnxruntime. This Repo is designed for YoloV3

pytorch darknet tensorrt onnx onnx-torch yolov3 inference-optimization onnxruntime

Updated Dec 31, 2019
Python

BaiTheBest / SparseLLM

Star

Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)

pruning model-compression inference-optimization alternating-optimization large-language-models efficient-ai

Updated Mar 27, 2025
Python

vbdi / divprune

Star

[CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

pruning inference-optimization multi-modality llm token-pruning vision-language-model llava multimodal-large-language-models

Updated Dec 1, 2025
Python

keli-wen / AGI-Study

Star

The blog, read report and code example for AGI/LLM related knowledge.

demo train code-examples inference-optimization llm

Updated Feb 1, 2025
Python

brontoguana / krasis

Star

Krasis is a Hybrid LLM runtime which focuses on efficient running of larger models on consumer grade VRAM limited hardware

transformer inference-engine inference-optimization mixture-of-experts cpu-inference large-language-models gpu-inference llm-inference high-performance-inference hybrid-inference gguf-model-support llama-cpp-alternative

Updated Mar 1, 2026
Python

yester31 / Monocular_Depth_Estimation_TRT

Star

Optimizing Monocular Depth Estimation with TensorRT: Model Conversion, Inference Acceleration, and 3D Reconstruction

Updated Jan 26, 2026
Python

ModelTC / LightTTS

Star

LightTTS is a lightweight TTS inference framework optimized for CosyVoice2 and CosyVoice3, enabling fast and scalable speech synthesis in Python and supports stream and bistream modes.

text-to-speech real-time tts speech-synthesis low-latency tensorrt inference-optimization audio-generation cosyvoice cosyvoice2 cosyvoice3

Updated Feb 27, 2026
Python

John-Wendell / Attention-MoA

Star

Official code of Attention-MoA: Enhancing Mixture-of-Agents via Inter-Agent Semantic Attention and Deep Residual Synthesis

multi-agent multi-agent-systems inference-optimization llms llm-inference llm-agents mixture-of-agents multi-agent-collaboration

Updated Jan 27, 2026
Python

ManuelSLemos / RabbitLLM

Star

Run 70B+ LLMs on a single 4GB GPU — no quantization required.

inference inference-optimization llm qwen airllm

Updated Feb 28, 2026
Python

ksm26 / Efficiently-Serving-LLMs

Star

Learn the ins and outs of efficiently serving Large Language Models (LLMs). Dive into optimization techniques, including KV caching and Low Rank Adapters (LoRA), and gain hands-on experience with Predibase’s LoRAX framework inference server.

text-generation batch-processing server-optimization model-serving model-acceleration inference-optimization optimization-techniques machine-learning-operations deep-learning-techniques model-inference-service performance-enhancement scalability-strategies serving-infrastructure large-scale-deployment

Updated Apr 12, 2024
Jupyter Notebook

lmaxwell / Armednn

Star

cross-platform modular neural network inference library, small and efficient

neural-network eigen lstm inference-engine eigen3 inference-optimization conv1d

Updated May 15, 2023
C++

ResponsibleAILab / DAM

Star

Dynamic Attention Mask (DAM) generate adaptive sparse attention masks per layer and head for Transformer models, enabling long-context inference with lower compute and memory overhead without fine-tuning.

inference-optimization sparse-attention efficient-ai

Updated Jun 16, 2025
Python

Improve this page

Add a description, image, and links to the inference-optimization topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the inference-optimization topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inference-optimization

Here are 78 public repositories matching this topic...

google / XNNPACK

mryab / efficient-dl-systems

alibaba / BladeDISC

jiazhihao / TASO

bentoml / llm-inference-handbook

mit-han-lab / inter-operator-scheduler

imedslab / pytorch_bn_fusion

ZFTurbo / Keras-inference-time-optimizer

Rapternmn / PyTorch-Onnx-Tensorrt

BaiTheBest / SparseLLM

vbdi / divprune

keli-wen / AGI-Study

brontoguana / krasis

yester31 / Monocular_Depth_Estimation_TRT

ModelTC / LightTTS

John-Wendell / Attention-MoA

ManuelSLemos / RabbitLLM

ksm26 / Efficiently-Serving-LLMs

lmaxwell / Armednn

ResponsibleAILab / DAM

Improve this page

Add this topic to your repo