Skip to content

Improve graph: entity types, traversal, ingestion pipeline, REST API, tests, and scoring#1139

Open
ChristianKniep wants to merge 3 commits intoMemMachine:mainfrom
ChristianKniep:knowledge_graph
Open

Improve graph: entity types, traversal, ingestion pipeline, REST API, tests, and scoring#1139
ChristianKniep wants to merge 3 commits intoMemMachine:mainfrom
ChristianKniep:knowledge_graph

Conversation

@ChristianKniep
Copy link
Contributor

Purpose of the change

Adds a knowledge graph layer to MemMachine, enabling multi-hop relationship traversal, entity-typed nodes, semantic feature relationships, graph analytics, and node deduplication on top of the existing Neo4j vector store. This allows the system to answer queries that require following connections across memories — for example, discovering that Bob is a TensorFlow expert and Project Atlas uses TensorFlow — rather than relying solely on vector similarity scoring.

Description

This PR introduces end-to-end knowledge graph capabilities across the storage, application, and API layers:

Graph infrastructure (neo4j_vector_graph_store.py, data_types.py, graph_traversal_store.py):

  • RELATED_TO edges are created between semantically similar features during ingestion, controlled by a configurable cosine-similarity threshold (related_to_threshold, default 0.70)
  • Entity-type labels are applied to nodes (ENTITY_TYPE_Person, ENTITY_TYPE_Concept, ENTITY_TYPE_Event, etc.) and exposed as a filter parameter on the search API
  • Multi-hop traversal, graph-filtered vector search, shortest-path queries, subgraph (ego-graph) extraction
  • GDS-powered analytics: PageRank, Louvain community detection, degree centrality, betweenness centrality
  • Background node deduplication with configurable SAME_AS proposals and merge/dismiss resolution via the API
  • Near-duplicate RELATED_TO edge suppression at similarity >= 0.99 to avoid noise

Semantic ingestion pipeline (semantic_ingestion.py, semantic_relationship_storage.py, feature_relationship_types.py):

  • After each ingestion cycle, semantic features are cross-linked with typed edges: RELATED_TO, CONTRADICTS, IMPLIES, SUPERSEDES
  • A new SemanticRelationshipStorage protocol exposes relationship CRUD and contradiction detection

Episode store deduplication (episode_sqlalchemy_store.py):

  • Adds a content_hash column (SHA-256 of session_key + producer_id + content) with ON CONFLICT DO NOTHING upsert on both PostgreSQL and SQLite
  • Includes an online migration that backfills existing rows and adds the unique constraint idempotently on startup
  • Episode.is_new flag allows callers to distinguish newly inserted episodes from deduplicated returns

REST API (graph_router.py, ~1,900 lines) — new /memories/graph route group:

  • POST /memories/graph/search/multi-hop — multi-hop traversal from an anchor node
  • POST /memories/graph/search/filtered — graph-filtered vector similarity search
  • POST /memories/graph/relationships — create typed feature relationships
  • POST /memories/graph/relationships/get — query relationships with direction and confidence filters
  • POST /memories/graph/relationships/delete — delete a specific relationship
  • POST /memories/graph/contradictions — find all CONTRADICTS pairs within a feature set
  • POST /memories/graph/dedup/proposals — list duplicate node proposals
  • POST /memories/graph/dedup/resolve — merge or dismiss duplicate pairs
  • POST /memories/graph/analytics/pagerank — compute PageRank (requires GDS)
  • POST /memories/graph/analytics/communities — Louvain community detection (requires GDS)
  • POST /memories/graph/analytics/stats — graph statistics (node/edge counts, degree, type distribution)
  • POST /memories/graph/analytics/shortest-path — shortest path between two nodes
  • POST /memories/graph/analytics/degree-centrality — degree centrality ranking
  • POST /memories/graph/analytics/betweenness — betweenness centrality (requires GDS)
  • POST /memories/graph/analytics/subgraph — ego-graph/subgraph extraction

Configuration (database_conf.py, configuration/__init__.py):

  • New Neo4j knobs: gds_enabled, gds_default_damping_factor, gds_default_max_iterations, pagerank_auto_enabled, pagerank_trigger_threshold, dedup_trigger_threshold, dedup_embedding_threshold, dedup_property_threshold, dedup_auto_merge
  • New semantic memory knob: related_to_threshold

Migration utilities (neo4j_migration.py):

  • One-shot helpers for upgrading existing Neo4j databases: audit_duplicate_uids, resolve_duplicate_uids, apply_uniqueness_constraints, backfill_entity_type_labels

Documentation (docs/open_source/graph.mdx + four experiment pages):

  • Updated graph capability overview and four experimental result pages comparing baseline vector search against graph-enriched search

Bruno collection (tools/bruno/):

  • Full end-to-end Bruno API collection covering health, ingestion, standard search, graph search, graph analytics, relationship CRUD, and deduplication flows across 7 folders

Dependencies: No new runtime Python dependencies. The Neo4j GDS plugin is optional; all non-analytics endpoints work with a vanilla Neo4j instance.

Fixes/Closes

Fixes #(issue number)

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (does not change functionality, e.g., code style improvements, linting)
  • Documentation update
  • Project Maintenance (updates to build scripts, CI, etc., that do not affect the main project)
  • Security (improves security without changing functionality)

How Has This Been Tested?

  • Unit Test
  • Integration Test
  • End-to-end Test
  • Test Script (please provide)
  • Manual verification (list step-by-step instructions)
Test file What it covers
test_neo4j_knowledge_graph.py Entity types, RELATED_TO edge creation, traversal
test_neo4j_knowledge_graph_integration.py Integration against a live Neo4j instance
test_neo4j_pagerank_pipeline.py Background PageRank pipeline
test_neo4j_shortest_path.py Shortest-path queries
test_neo4j_subgraph_extraction.py Ego-graph / subgraph extraction
test_neo4j_degree_centrality.py Degree centrality
test_neo4j_betweenness_centrality.py Betweenness centrality (GDS)
test_neo4j_cross_collection_traversal.py Cross-collection traversal
test_neo4j_gds_refinements.py GDS edge cases and path-quality scoring
test_neo4j_graph_stats.py Graph stats endpoint
test_graph_data_types.py Data type unit tests
test_episode_dedup.py Episode content-hash deduplication
test_neo4j_feature_relationships_integration.py Semantic relationship storage
test_neo4j_graph_relationships_integration.py Graph relationship integration
test_neo4j_utils.py Neo4j utility helpers
test_semantic_memory_graph_enrichment.py Semantic memory graph enrichment
test_semantic_prompt_template.py Prompt template
test_declarative_memory_entity_types.py Entity type filtering in declarative memory
test_declarative_memory_graph_search.py Graph-assisted declarative search
test_graph_router.py REST API graph router unit tests
test_graph_integration.py REST API graph integration tests

Test Results: All unit tests pass locally. Integration tests require a running Neo4j instance. GDS analytics tests additionally require the Neo4j GDS plugin.

Checklist

  • I have signed the commit(s) within this pull request
  • My code follows the style guidelines of this project (See STYLE_GUIDE.md)
  • I have performed a self-review of my own code
  • I have commented my code
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Maintainer Checklist

  • Confirmed all checks passed
  • Contributor has signed the commit(s)
  • Reviewed the code
  • Run, Tested, and Verified the change(s) work as expected

Screenshots/Gifs

N/A

Further comments

  • GDS analytics endpoints (/analytics/pagerank, /analytics/communities, /analytics/betweenness) require the Neo4j Graph Data Science plugin and gds_enabled: true in the Neo4j configuration. All other graph endpoints work with a standard Neo4j instance.
  • The episode content-hash deduplication migration runs automatically on startup and is safe to apply to existing PostgreSQL and SQLite databases.
  • RELATED_TO edges with similarity >= 0.99 are suppressed to avoid near-duplicate noise between identical or near-identical semantic features.
  • Path-quality scoring is applied during multi-hop traversal: a result at hop distance d receives a score of score_decay^d (default score_decay = 0.7), and paths crossing low-similarity RELATED_TO edges are penalised via the path_quality field on MultiHopResult.

@ChristianKniep ChristianKniep added the poc Proof-of-concept implementation for a solution, feature, idea, etc. label Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

poc Proof-of-concept implementation for a solution, feature, idea, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant