Multimodal Personal Photo Knowledge Retrieval System
Built for INFO 7375 -- Prompt Engineering for Generative AI (Northeastern University, Spring 2026)
Raghu Ram Shanta Rajamani -- shantharajamani.r@northeastern.edu
What It Does · The Stack · Architecture · Setup · Usage · RL Extension · Evaluation Results · Project Structure
This repository contains code for three incrementally developed assignments, each on its own branch:
Assignment Branch Generative AI Project Assignment (Final Project) main← you are hereTake-Home Final: Reinforcement Learning for Agentic AI Systems feature/reinforcement-learning-extensionBuilding Agentic Systems Assignment feature/building_agentic_systems_assignment
You took a photo of a grocery receipt last Tuesday. Now you need to know exactly what you spent. You could scroll through 500 photos, or you could just ask.
PhotoMind turns your phone's photo library into a queryable knowledge base. It analyzes photos with GPT-4o Vision, indexes them into a hybrid vector search engine, and answers natural-language questions with confidence scores and source attribution. Two RL components -- contextual bandits for query routing and a DQN for confidence calibration -- eliminate silent failures entirely.
Repository: github.com/raghuneu/PhotoMind · Demo: youtu.be/XMeceeWHjCE
| Step | What Happens | Details |
|---|---|---|
| 1. Ingest | Scans photos/ and analyzes each image with GPT-4o Vision |
OCR, entity extraction, classification. Dual-writes to JSON + Qdrant vector DB. Idempotent. |
| 2. Index | Builds a hybrid-searchable knowledge base | Dense embedding ANN + keyword matching via Reciprocal Rank Fusion (RRF) |
| 3. Route | Classifies query intent automatically | Factual / Semantic / Behavioral — RL bandit or rule-based fallback |
| 4. Retrieve | Searches KB with the best strategy for the query type | Entity matching, keyword overlap, or frequency aggregation |
| 5. Synthesize | Produces a grounded answer with confidence grade and citations | Grade A–F, numeric score 0–1, source photo filenames |
Three query modes:
- Factual — "How much did I spend at ALDI?" → extracts $18.69 from receipt OCR
- Semantic — "Show me photos of pizza" → matches against visual descriptions
- Behavioral — "What type of food do I photograph most?" → aggregates patterns across all photos
Every answer includes a confidence grade (A–F) and cites the specific source photo.
- GPT-4o Vision -- Photo analysis, OCR, entity extraction, classification
- CrewAI -- Multi-agent orchestration (4 agents, 2 crews)
- sentence-transformers --
all-MiniLM-L6-v2for local embeddings (384-dim) - PyTorch -- DQN confidence calibrator
- scikit-learn -- KMeans clustering for bandit context features
- Qdrant -- Vector database with hybrid search (dense ANN + keyword via RRF)
- JSON fallback -- Flat file backend for offline/RL training
- Repository pattern -- Swappable backends via ABC
- FastAPI -- REST API with LRU cache, API key auth, SSE streaming, multi-user scoping
- Pydantic -- Settings management, input validation
- React + TypeScript + Vite -- SPA frontend
- MUI (Material UI) -- Component library
- Recharts -- Data visualization
INGESTION CREW (Process.sequential)
[Scan photos/] → [Analyze with GPT-4o Vision] → [Build JSON knowledge base]
(--direct flag available for faster batch processing via direct API calls)
Dual-write: ingests into both JSON file and Qdrant vector DB simultaneously
QUERY CREW (Process.hierarchical, manager-delegated)
[Controller] classifies query intent → delegates to specialists
├── Task 1: Knowledge Retriever — searches KB with PhotoKnowledgeBaseTool
└── Task 2: Insight Synthesizer — synthesizes answer with confidence grade + citation
STORAGE LAYER (Repository Pattern)
├── QdrantPhotoRepository (default) — vector DB with hybrid search (RRF)
└── JsonPhotoRepository (fallback) — flat JSON file for offline/RL training
API SERVER (FastAPI)
uvicorn api.server:app — REST API with LRU cache, API key auth, multi-user scoping
FEEDBACK LOOP (persistent, adaptive)
[Eval results] → [FeedbackStore] → adjusts confidence thresholds per strategy
| Agent | Role | Tools |
|---|---|---|
| Controller | Orchestrates query routing; classifies intent (factual/semantic/behavioral) | — (manager) |
| Photo Analyst | Analyzes images: OCR, entity extraction, classification | PhotoVisionTool, DirectoryReadTool |
| Knowledge Retriever | Searches knowledge base; retrieves evidence | PhotoKnowledgeBaseTool (custom), FileReadTool |
| Insight Synthesizer | Synthesizes grounded answers with confidence grading | FileReadTool, SerperDevTool (optional) |
| Tool | Type | Purpose |
|---|---|---|
PhotoVisionTool |
Custom | GPT-4o Vision wrapper with HEIC support (src/tools/photo_vision.py) |
PhotoKnowledgeBaseTool |
Custom | 3-strategy search with feedback integration (src/tools/photo_knowledge_base.py) |
DirectoryReadTool |
Built-in | Scans the photos directory |
FileReadTool |
Built-in | Reads the knowledge base file |
JSONSearchTool |
Built-in | Embedding-based JSON search (sentence-transformers) |
SerperDevTool |
Built-in (optional) | Web search enrichment |
- Python 3.10–3.13 (CrewAI requires
< 3.14) - OpenAI API key (GPT-4o access)
- Docker (for Qdrant vector DB — optional, falls back to JSON)
# If your system Python is 3.14+, use pyenv:
~/.pyenv/versions/3.10.14/bin/python3 -m venv .venv
source .venv/bin/activate
# Otherwise:
python3 -m venv .venv
source .venv/bin/activatepip install -r requirements.txtFirst run downloads the all-MiniLM-L6-v2 sentence-transformer model (~80MB) for local embeddings.
cp .env.example .env
# Edit .env — add your OPENAI_API_KEYRequired:
OPENAI_API_KEY— GPT-4o Vision + agent reasoning
Optional:
SERPER_API_KEY— enables web search enrichment in answersREPOSITORY_BACKEND—json(default) orqdrantfor vector DB
mkdir -p photos knowledge_base eval/resultsPlace 15–25 photos in photos/. iPhone photos (HEIC format) are fully supported. Recommended mix for best demo coverage:
- Grocery/restaurant receipts (3–5)
- Food photos (3–5)
- Screenshots from apps or websites (1–2)
- Documents or notes (1–2)
docker run -d --name qdrant -p 6333:6333 -p 6334:6334 qdrant/qdrantQdrant runs as a local Docker container. The ingestion pipeline auto-creates the photos collection on first write. If Qdrant is unavailable, the system falls back to the JSON file backend automatically.
# Default: CrewAI multi-agent pipeline
python -m src.main ingest
# Fast mode: direct API calls, bypasses CrewAI agents
python -m src.main ingest --directAnalyzes all photos in photos/ using GPT-4o Vision and dual-writes to knowledge_base/photo_index.json and Qdrant. Idempotent — re-running skips already-indexed photos.
python -m src.main query "How much did I spend at ALDI?" # factual
python -m src.main query "Show me photos of pizza" # semantic
python -m src.main query "What type of food do I photograph most?" # behavioral
python -m src.main query "What was my electric bill?" # edge case — should decline
# Fast path — skip CrewAI, zero OpenAI cost, sub-second latency
python -m src.main query --direct "How much did I spend at ALDI?"uvicorn api.server:app --reload --port 8000The FastAPI server exposes the knowledge base and RL models as a REST API. Two query modes:
- Fast (
mode: "fast") — direct Python search with RL routing, no OpenAI calls, <1s, free - Full (
mode: "full") — CrewAI pipeline with GPT-4o reasoning, 15–45s, ~$0.01–0.05/query
| Endpoint | Method | Auth | Description |
|---|---|---|---|
/api/query |
POST | API key | Query the knowledge base |
/api/query/stream |
POST | API key | SSE streaming variant — emits routing/retrieval/token/done events for progressive UI rendering |
/api/knowledge-base |
GET | — | List all indexed photos |
/api/health |
GET | — | Health check with backend status |
/api/cache/clear |
POST | API key | Clear the LRU query cache |
/api/feedback |
POST | API key | Submit feedback for adaptive thresholds |
/api/eval/results |
GET | — | Latest evaluation results |
Headers:
X-API-Key— required for POST endpoints whenAPI_KEYis set in.envX-User-Id— optional; scopes queries to a per-user Qdrant collection and JSON KB
Example query:
curl -X POST http://localhost:8000/api/query \
-H "Content-Type: application/json" \
-d '{"query": "How much did I spend at ALDI?", "mode": "fast"}'python -m src.main evalRuns 20 test queries across 4 categories and reports:
- Retrieval Accuracy — was the correct source photo found?
- Routing Accuracy — was the query intent correctly detected?
- Silent Failure Rate — did the system ever confidently return a wrong answer?
- Decline Accuracy — were impossible queries correctly refused?
Results saved to eval/results/eval_results.json. Run history appended to eval/results/eval_history.json for trend analysis.
# Train both components across 5 seeds (4000 episodes each, ~120s total)
python -m src.main train
# Train with custom episode count
python -m src.main train 1000
# Run full RL evaluation: 5 configs x 5 seeds x 83 queries
python -m src.main rl-eval
# Run 7-config ablation study → eval/results/ablation_results.json
python -m src.main ablationRL training runs entirely on the cached knowledge base (zero API cost). Requires an existing knowledge_base/photo_index.json — run ingestion first. Trained models are saved to knowledge_base/rl_models/ and loaded automatically at query time.
PhotoMind integrates two RL approaches that replace rule-based components:
Replaces the keyword-based _classify_query() in PhotoKnowledgeBaseTool with a learned policy that selects the optimal search strategy (factual / semantic / behavioral) based on query features.
- ThompsonSamplingBandit — Beta posterior per context cluster, provably optimal exploration
- UCBBandit — UCB1 upper confidence bound per cluster
- EpsilonGreedyBandit — Baseline comparison
- Context clustering via KMeans on 396-dimensional hybrid query feature vectors (12 handcrafted + 384 MiniLM embedding dims)
- Training: 4000 episodes × 5 seeds on offline cached search results (zero API cost)
Replaces static confidence thresholding with a DQN that learns when to accept, hedge, or decline retrieval results — directly addressing the silent failure problem.
- Architecture: FC(8→64) → ReLU → FC(64→64) → ReLU → FC(64→5), adapted from the LunarLander DQN (extended with a requery action)
- State: 8-dim vector (top score, score gap, result count, strategy index, query features, entity match indicators)
- Actions:
accept_high,accept_moderate,hedge,requery,decline - Reward: penalty matrix that heavily penalizes silent failures (confident-but-wrong answers)
Key result: Full RL eliminates silent failures (0.0% vs 1.8% baseline; p < 0.0001). See Evaluation Results for the full ablation table.
Offline simulation training: Both components are trained using PhotoMindSimulator, which pre-computes all 3 search strategies on all 83 queries once (zero API calls). Training 4000 episodes × 5 seeds × 2 components takes ~120 seconds on CPU.
src/tools/photo_knowledge_base.py — the core differentiator.
Three search strategies selected automatically by query intent:
| Strategy | Trigger Keywords | How It Works |
|---|---|---|
| Factual | "how much", "date", "address", "items", "vendor" | Entity matching + OCR text search |
| Semantic | Default (no other match) | Keyword overlap on descriptions, normalized by meaningful words |
| Behavioral | "most", "often", "how many", "breakdown", "pattern" | Frequency aggregation across all photos |
Output always includes: confidence grade (A–F), numeric score (0–1), source photo filenames, and a plain-language summary.
Input schema (Pydantic):
query: str # Natural language question
query_type: str = "auto" # Force strategy or let tool classify
top_k: int = 3 # Number of results
confidence_threshold: float = 0.15 # Minimum score to include| Metric | Score |
|---|---|
| Retrieval Accuracy | 100% |
| Routing Accuracy | 85% |
| Silent Failure Rate | 0% |
| Decline Accuracy | 100% |
| Avg Latency | ~46s/query |
| Config | Retrieval | Routing | Silent Fail | Decline Acc |
|---|---|---|---|---|
| Baseline (rule-based) | 87.5% | 76.8% | 1.8% | 90.9% |
| Bandit Only (Thompson) | 86.8% | 66.4% | 0.0% | 81.8% |
| DQN Only | 87.5% | 76.8% | 0.4% | 100.0% |
| Full RL (Thompson+DQN) | 87.5% | 67.1% | 0.0% | 98.2% |
Statistical tests (Full RL vs Baseline): silent failure p < 0.0001 (***), decline accuracy p = 0.016 (*)
PhotoMind/
├── api/
│ └── server.py # FastAPI REST API (LRU cache, auth, multi-user)
├── src/
│ ├── main.py # CLI entry point (ingest / query / eval / train / rl-eval / ablation)
│ ├── config.py # Pydantic settings (reads .env)
│ ├── ingest_direct.py # Direct ingestion (1 API call/photo, dual-write)
│ ├── agents/
│ │ └── definitions.py # 4 agent factory functions
│ ├── tasks/
│ │ ├── ingestion.py # Scan, analyze, index tasks
│ │ └── query.py # Query task with intent routing
│ ├── crews/
│ │ ├── ingestion_crew.py # Sequential ingestion pipeline
│ │ └── query_crew.py # Hierarchical query pipeline
│ ├── tools/
│ │ ├── photo_vision.py # PhotoVisionTool (GPT-4o Vision + HEIC)
│ │ ├── photo_knowledge_base.py # PhotoKnowledgeBaseTool (custom) — RL-enhanced
│ │ ├── query_memory.py # Query memory and deduplication
│ │ └── feedback_store.py # FeedbackStore (adaptive threshold learning)
│ ├── storage/
│ │ ├── __init__.py # Public exports (PhotoRepository, get_repository)
│ │ ├── repository.py # ABC + JsonPhotoRepository + QdrantPhotoRepository + factory
│ │ └── qdrant_client.py # Qdrant connection helper (hybrid search, RRF)
│ └── rl/
│ ├── rl_config.py # Centralized RL hyperparameters and reward matrix
│ ├── feature_extractor.py # Query → 396-dim hybrid feature vector
│ ├── contextual_bandit.py # Thompson Sampling, UCB, epsilon-greedy bandits
│ ├── dqn_confidence.py # ConfidenceDQN and ConfidenceDQNAgent
│ ├── replay_buffer.py # Experience replay buffer (adapted from LunarLander)
│ ├── reward.py # Reward computation for bandit and DQN
│ ├── simulation_env.py # Offline training environment (zero API cost)
│ └── training_pipeline.py # Orchestrates training across seeds
├── eval/
│ ├── test_cases.py # 20 original hand-labeled test queries
│ ├── expanded_test_cases.py # 63 new cases (incl. 14 ambiguous) — 83 total
│ ├── novel_test_cases.py # 15 intent-shift queries for robustness testing
│ ├── run_evaluation.py # Base system evaluation harness
│ ├── run_rl_evaluation.py # RL 5-config comparison harness
│ ├── ablation.py # 7-config ablation with paired t-tests
│ ├── statistical_analysis.py # CI, paired t-test, Cohen's d utilities
│ └── results/ # JSON results + eval history
├── viz/
│ ├── plot_learning_curves.py # Bandit regret, DQN rewards, posteriors, epsilon decay
│ ├── plot_ablation.py # Grouped bar chart (7 configs x 4 metrics)
│ ├── plot_regret.py # Cumulative regret comparison (3 bandit types)
│ ├── plot_before_after.py # Before/after RL comparison plots
│ ├── generate_diagrams.py # Architecture and flow diagrams
│ ├── figures/ # Generated PNG and PDF figures
│ └── diagrams/ # Mermaid diagram sources (.mmd)
├── scripts/
│ ├── train_full.py # Full training + optional ablation
│ ├── train_bandit.py # Standalone bandit training
│ ├── train_dqn.py # Standalone DQN training
│ ├── precompute_cache.py # Pre-compute search strategy cache
│ ├── scaling_benchmark.py # Scaling and performance benchmarks
│ ├── migrate_to_qdrant.py # JSON → Qdrant migration utility
│ └── demo_comparison.py # Rule-based vs RL before/after demo
├── knowledge_base/
│ ├── photo_index.json # 53 indexed photos
│ └── rl_models/ # Trained RL models
│ ├── bandit_thompson.pkl # Trained Thompson Sampling bandit
│ └── dqn_confidence.pth # Trained DQN confidence calibrator
├── tests/
│ ├── test_core.py # Core RL functionality tests (59 tests)
│ ├── test_search_strategies.py # Search strategy correctness tests (18 tests)
│ └── test_repository.py # Repository abstraction tests (13 tests)
├── web/ # React + TypeScript + Vite frontend (MUI, Recharts)
│ ├── src/
│ │ ├── App.tsx # Main app component
│ │ ├── components/ # UI components
│ │ └── theme.ts # MUI theme configuration
│ ├── index.html
│ └── package.json
├── docs/
│ ├── figures/ # Copied figures for GitHub Pages
│ ├── math_formulations.md # Mathematical formulations for RL components
│ ├── mermaid_diagrams/ # Mermaid diagram sources
│ └── report/ # LaTeX technical report (main.tex → main.pdf)
├── Dockerfile # Multi-stage build (Node frontend + Python backend)
├── .dockerignore
├── .env.example
├── .gitignore
├── LICENSE
├── requirements.txt
└── TECHNICAL_REPORT.md # Full technical documentation (base system + RL extension)


