Skip to main content
The langchain-nvidia-ai-endpoints package provides LangChain integrations for chat, embeddings, reranking, and retrieval powered by NVIDIA AI — including Nemotron, NVIDIA’s open model family built for agentic AI, and hundreds of community models on the NVIDIA API Catalog. Models run on NVIDIA NIM microservices: container images that expose a standard OpenAI-compatible API, optimized with TensorRT-LLM for peak throughput on NVIDIA hardware. They can be accessed via the hosted API Catalog or self-hosted on-premises.

Components

ComponentClassDescription
ChatChatNVIDIAChat completions with any NVIDIA-hosted model or local NIM
Chat (Dynamo)ChatNVIDIADynamoChatNVIDIA with KV cache routing hints for Dynamo deployments
EmbeddingsNVIDIAEmbeddingsDense vector embeddings for semantic search and RAG
RerankingNVIDIARerankDocument reranking by query relevance
RetrievalNVIDIARetrieverRetrieval from an NVIDIA RAG Blueprint server

Chat: ChatNVIDIA

ChatNVIDIA provides chat completions over NVIDIA-hosted models and local NIM deployments. It supports tool calling, structured output, image inputs, and streaming.

Install

pip install -qU langchain-nvidia-ai-endpoints

Access the NVIDIA API Catalog

  1. Create a free account on the NVIDIA API Catalog and log in.
  2. Click your profile icon, then API Keys > Generate API Key.
  3. Copy and save the key as NVIDIA_API_KEY.
import getpass
import os

if os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    print("Valid NVIDIA_API_KEY already in environment. Delete to reset")
else:
    nvapi_key = getpass.getpass("NVAPI Key (starts with nvapi-): ")
    assert nvapi_key.startswith(
        "nvapi-"
    ), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key
Nemotron is NVIDIA’s open model family designed for agentic AI. The models use a hybrid Mamba-Transformer mixture-of-experts architecture that delivers leading benchmark performance with high throughput and support for up to 1M token context windows. Nemotron model weights, training data, and implementation recipes are published openly under the NVIDIA Open Model License.
from langchain_nvidia_ai_endpoints import ChatNVIDIA

# Nemotron 3 Nano — efficient reasoning and agentic tasks
llm = ChatNVIDIA(model="nvidia/nemotron-3-nano-30b-a3b")
result = llm.invoke("Plan a three-step research workflow for competitive analysis.")
print(result.content)
See the ChatNVIDIA integration page for full documentation including tool calling, multimodal inputs, and Nemotron-specific examples.

Chat: ChatNVIDIADynamo

ChatNVIDIADynamo is a drop-in replacement for ChatNVIDIA for use with NVIDIA Dynamo deployments. It automatically injects KV cache routing hints into every request, allowing the Dynamo scheduler to optimize memory allocation, load routing, and request priority.
from langchain_nvidia_ai_endpoints import ChatNVIDIADynamo

llm = ChatNVIDIADynamo(
    base_url="http://localhost:8099/v1",
    model="nvidia/nemotron-3-nano-30b-a3b",
    osl=512,             # expected output sequence length (tokens)
    iat=250,             # expected inter-arrival time (ms)
    latency_sensitivity=1.0,
    priority=1,
)
result = llm.invoke("Summarize KV cache routing in one sentence.")
print(result.content)
See the ChatNVIDIA integration page for the full ChatNVIDIADynamo reference including per-invocation overrides and streaming.

Embeddings: NVIDIAEmbeddings

NVIDIAEmbeddings generates dense vector embeddings for use in semantic search and RAG pipelines.
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

embedder = NVIDIAEmbeddings(model="NV-Embed-QA")
embedder.embed_query("What's the temperature today?")
See the NVIDIAEmbeddings integration page for full documentation.

Reranking: NVIDIARerank

NVIDIARerank reranks a list of documents by relevance to a query using a NeMo Retriever reranking NIM.
from langchain_core.documents import Document
from langchain_nvidia_ai_endpoints import NVIDIARerank

ranker = NVIDIARerank(model="nvidia/llama-3.2-nv-rerankqa-1b-v1")
docs = ranker.compress_documents(
    query="What is GPU memory bandwidth?",
    documents=[Document(page_content=p) for p in passages],
)

Retrieval: NVIDIARetriever

NVIDIARetriever connects LangChain to a running NVIDIA RAG Blueprint server and retrieves relevant documents via the /v1/search endpoint. It supports reranking, query rewriting, and metadata filtering.
from langchain_nvidia_ai_endpoints import NVIDIARetriever

retriever = NVIDIARetriever(base_url="http://localhost:8081", k=4)
docs = retriever.invoke("What is NVIDIA NIM?")
See the NVIDIARetriever integration page for full documentation.

Self-host with NVIDIA NIM Microservices

When you are ready to deploy your AI application, you can self-host models with NVIDIA NIM. For more information, refer to NVIDIA NIM Microservices.
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings, NVIDIARerank

# connect to a chat NIM running at localhost:8000, specifying a model
llm = ChatNVIDIA(base_url="http://localhost:8000/v1", model="nvidia/nemotron-3-nano-30b-a3b")

# connect to an embedding NIM running at localhost:8080
embedder = NVIDIAEmbeddings(base_url="http://localhost:8080/v1")

# connect to a reranking NIM running at localhost:2016
ranker = NVIDIARerank(base_url="http://localhost:2016/v1")

Accelerate LangGraph with langchain-nvidia-langgraph

The langchain-nvidia-langgraph package provides NVIDIA-optimized execution strategies for LangGraph graphs. It offers two complementary optimizations applied at compile time:
  • Parallel execution: independent nodes are automatically identified and run concurrently, eliminating unnecessary sequential bottlenecks.
  • Speculative execution: both branches of a conditional edge run simultaneously; the wrong branch is discarded once the routing condition resolves.
Neither optimization requires changes to node logic or graph edges.

Install

pip install -qU langchain-nvidia-langgraph

Parallel execution

Replace StateGraph with NvidiaStateGraph. The rest of your graph definition stays the same.
from langchain_nvidia_langgraph import NvidiaStateGraph
from langgraph.graph import END
from typing import TypedDict

class AgentState(TypedDict):
    question: str
    context: str
    answer: str

graph = NvidiaStateGraph(AgentState)
graph.add_node("retrieve", retrieve_node)
graph.add_node("rerank", rerank_node)
graph.add_node("generate", generate_node)
graph.add_edge("retrieve", "rerank")
graph.add_edge("rerank", "generate")
graph.add_edge("generate", END)
graph.set_entry_point("retrieve")

app = graph.compile()  # independent nodes run in parallel automatically

Speculative execution

Enable speculation at compile time. The executor runs both sides of conditional branches in parallel and keeps the result that matches the routing decision.
app = graph.compile(speculation=True)
Limitations: Speculative execution does not support LangGraph checkpointing, streaming, interrupts, or human-in-the-loop workflows. Use speculation=False (the default) when those features are needed.

Control parallelism with decorators

Three decorators give explicit control over which nodes participate in optimization:
from langchain_nvidia_langgraph import sequential, depends_on, speculation_unsafe

# Prevent a node from being parallelized (e.g., it writes to shared state)
@sequential
def write_to_db(state):
    ...

# Declare a dependency not expressed in graph edges
@depends_on("retrieve")
def log_retrieval(state):
    ...

# Exclude a node from speculative execution (e.g., it has side effects)
@speculation_unsafe
def send_notification(state):
    ...

Wrap an existing compiled graph

If you already have a compiled LangGraph app and want to add NVIDIA optimization without modifying your graph definition, use with_app_compile:
from langchain_nvidia_langgraph import with_app_compile
from langgraph.graph import StateGraph

graph = StateGraph(AgentState)
# ... add nodes and edges as normal ...

app = with_app_compile(graph, speculation=False)

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.