langchain-nvidia-ai-endpoints package provides LangChain integrations for chat, embeddings, reranking, and retrieval powered by NVIDIA AI — including Nemotron, NVIDIA’s open model family built for agentic AI, and hundreds of community models on the NVIDIA API Catalog.
Models run on NVIDIA NIM microservices: container images that expose a standard OpenAI-compatible API, optimized with TensorRT-LLM for peak throughput on NVIDIA hardware. They can be accessed via the hosted API Catalog or self-hosted on-premises.
Components
| Component | Class | Description |
|---|---|---|
| Chat | ChatNVIDIA | Chat completions with any NVIDIA-hosted model or local NIM |
| Chat (Dynamo) | ChatNVIDIADynamo | ChatNVIDIA with KV cache routing hints for Dynamo deployments |
| Embeddings | NVIDIAEmbeddings | Dense vector embeddings for semantic search and RAG |
| Reranking | NVIDIARerank | Document reranking by query relevance |
| Retrieval | NVIDIARetriever | Retrieval from an NVIDIA RAG Blueprint server |
Chat: ChatNVIDIA
ChatNVIDIA provides chat completions over NVIDIA-hosted models and local NIM deployments. It supports tool calling, structured output, image inputs, and streaming.
Install
Access the NVIDIA API Catalog
- Create a free account on the NVIDIA API Catalog and log in.
- Click your profile icon, then API Keys > Generate API Key.
- Copy and save the key as
NVIDIA_API_KEY.
Nemotron: featured models for agentic AI
Nemotron is NVIDIA’s open model family designed for agentic AI. The models use a hybrid Mamba-Transformer mixture-of-experts architecture that delivers leading benchmark performance with high throughput and support for up to 1M token context windows. Nemotron model weights, training data, and implementation recipes are published openly under the NVIDIA Open Model License.ChatNVIDIA integration page for full documentation including tool calling, multimodal inputs, and Nemotron-specific examples.
Chat: ChatNVIDIADynamo
ChatNVIDIADynamo is a drop-in replacement for ChatNVIDIA for use with NVIDIA Dynamo deployments. It automatically injects KV cache routing hints into every request, allowing the Dynamo scheduler to optimize memory allocation, load routing, and request priority.
ChatNVIDIA integration page for the full ChatNVIDIADynamo reference including per-invocation overrides and streaming.
Embeddings: NVIDIAEmbeddings
NVIDIAEmbeddings generates dense vector embeddings for use in semantic search and RAG pipelines.
NVIDIAEmbeddings integration page for full documentation.
Reranking: NVIDIARerank
NVIDIARerank reranks a list of documents by relevance to a query using a NeMo Retriever reranking NIM.
Retrieval: NVIDIARetriever
NVIDIARetriever connects LangChain to a running NVIDIA RAG Blueprint server and retrieves relevant documents via the /v1/search endpoint. It supports reranking, query rewriting, and metadata filtering.
NVIDIARetriever integration page for full documentation.
Self-host with NVIDIA NIM Microservices
When you are ready to deploy your AI application, you can self-host models with NVIDIA NIM. For more information, refer to NVIDIA NIM Microservices.Accelerate LangGraph with langchain-nvidia-langgraph
Thelangchain-nvidia-langgraph package provides NVIDIA-optimized execution strategies for LangGraph graphs. It offers two complementary optimizations applied at compile time:
- Parallel execution: independent nodes are automatically identified and run concurrently, eliminating unnecessary sequential bottlenecks.
- Speculative execution: both branches of a conditional edge run simultaneously; the wrong branch is discarded once the routing condition resolves.
Install
Parallel execution
ReplaceStateGraph with NvidiaStateGraph. The rest of your graph definition stays the same.
Speculative execution
Enable speculation at compile time. The executor runs both sides of conditional branches in parallel and keeps the result that matches the routing decision.
Limitations: Speculative execution does not support LangGraph checkpointing, streaming, interrupts, or human-in-the-loop workflows. Use speculation=False (the default) when those features are needed.
Control parallelism with decorators
Three decorators give explicit control over which nodes participate in optimization:Wrap an existing compiled graph
If you already have a compiled LangGraph app and want to add NVIDIA optimization without modifying your graph definition, usewith_app_compile:
Related topics
langchain-nvidia-ai-endpointspackage READMElangchain-nvidia-langgraphpackage- Nemotron model family
- Overview of NVIDIA NIM for Large Language Models (LLMs)
- Overview of NeMo Retriever Embedding NIM
- Overview of NeMo Retriever Reranking NIM
ChatNVIDIAModelNVIDIAEmbeddingsModel for RAG WorkflowsNVIDIARetriever- NVIDIA Dynamo
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.