ScreenMind/README.md at main · ayushh0110/ScreenMind

**Captures your screen → Analyzes with Gemma 4 → Builds a searchable AI memory**
**100% local. 100% private. Zero cloud dependencies.**

**Features** · **Comparison** · **Gemma 4 Deep Dive** · **Quick Start** · **Architecture** · **Agent Platform** · **MCP** · **API**

| Agents | | --- | |  |
**💬 Chat in Action** — _Ask anything about your screen history_

> **Microsoft showed the world wants screen-aware AI with Recall.** But Recall stores data in plaintext, sends telemetry, and was met with massive privacy backlash. ScreenMind is the open-source, privacy-first alternative — every screenshot analyzed, every insight generated, every search result — all computed locally using Gemma 4's multimodal capabilities. > > > It's not just a screen recorder. It's an **AI memory** you can talk to, search through, and build automations on top of.
- * *
✨ Features
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-features)
🧠 Core Intelligence
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-core-intelligence)
- **📸 Smart Capture** — Content-change detection, not a fixed timer. Captures when your screen _actually_ changes.
- **🔬 Gemma 4 Vision Analysis** — Every screenshot analyzed: app detection, activity categorization, mood, scene description, spatial layout regions.
- **🔍 Hybrid Search** — Semantic embeddings (MiniLM) + FTS5 keyword search. Find anything by _meaning_, not just keywords.
- **💬 Chat with Memory** — Conversational RAG with follow-up support. Ask "what did Alex say on Discord?" → get the actual message.
- **🧠 Model Hub** — In-app model download with live progress tracking. Chat and Summary are locked with witty brain animations until the model is ready — then auto-unlock. No terminal needed.
- **🎙️ Voice Memos** — Hold `Ctrl+Shift+V` → Gemma 4's native audio encoder transcribes. Screenshot captured alongside.
- **🎤 Meeting Transcription** — Auto-detects Zoom/Teams/Meet, records audio, transcribes, generates structured summaries.
- **📊 Analytics Dashboard** — Category breakdown, top apps, hourly heatmap, meeting stats, focus metrics.
- **⏪ Day Rewind** — Timelapse playback of your entire day with play/pause/scrub/speed controls.
⚡ Performance
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-performance)
- **Three Analysis Modes** — Accurate (~76s, deep thinking + layout), Balanced (~40s, thinking), or Fast (~12s, no thinking). You choose.
- **Per-App pHash Cache** — 3-tier caching with app-aware staleness. Communication apps refresh faster than IDEs. Significantly fewer inference calls.
- **Chat-First GPU Priority** — Chat cancels in-flight analysis instantly. GPU freed in <1s.
- **Auto-Pause Heavy Apps** — Games, video editors, 3D software detected → capture pauses automatically.
🔒 Privacy & Security
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-privacy--security)
- **100% Local** — All data stays on your machine. Zero network calls after initial model download. No telemetry. Ever.
- **Sensitive Data Filter** — Auto-redacts credit cards, SSNs, API keys, passwords before storage.
- **Encryption at Rest** — AES encryption for screenshots (Fernet + OS keyring).
- **Dashboard PIN Lock** — Session-based auth with configurable auto-lock timeout.
- **Incognito Mode** — One-click pause. Nothing recorded.
**🔌 Integrations & Extensibility**
| Integration | Description | | --- | --- | | 🤖 **Agent Platform** | Build automations in Markdown (English) or Python. Drop a file, get an agent. | | 🔌 **MCP Server** | Expose screen history to Claude Desktop, Cursor, VS Code | | 📓 **Obsidian** | Auto-sync daily summaries to your vault | | 📋 **Notion** | Push summaries to a Notion database | | 🪝 **Webhooks** | Fire events to Slack, Discord, IFTTT (HMAC signed, auto-retry) | | 🔔 **Smart Notifications** | Distraction alerts, break reminders | | ⭐ **Auto-Bookmark** | Keyword triggers (`git push`, `deploy`) auto-flag important moments |
⌨️ System-Wide Hotkeys
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#%EF%B8%8F-system-wide-hotkeys) | Hotkey | Action | | --- | --- | | `Ctrl+Shift+B` | 📸 Instant bookmarked capture | | `Ctrl+Shift+P` | ⏸ Toggle pause/resume | | `Ctrl+Shift+V` | 🎤 Hold to record voice memo |
> All hotkeys customizable from Settings.
- * *
📊 How ScreenMind Compares
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-how-screenmind-compares) | Feature | **ScreenMind** | **Screenpipe** | **Microsoft Recall** | | --- | --- | --- | --- | | **License** | ✅ MIT (fully open-source) | Source-available (commercial license required for business use) | Proprietary | | **Cost** | ✅ Free forever | Free (personal) / Paid (commercial) | Requires $1000+ Copilot+ PC | | **Privacy** | ✅ Zero network calls. Zero telemetry. Ever. | Local-first, optional cloud | Telemetry opt-in. Data stayed local after backlash. | | **Min. hardware** | ✅ Any GPU ≥4GB VRAM (or CPU-only) | 8GB RAM, modern CPU | 40 TOPS NPU + 16GB RAM + BitLocker + Windows Hello | | **AI architecture** | ✅ Single model — Gemma 4 does vision + audio + reasoning | Multiple models — OCR + Whisper + external LLM | Proprietary NPU model | | **Audio/meetings** | ✅ Native — Gemma 4 audio encoder (no Whisper needed) | Whisper-based transcription | ❌ Not supported | | **Smart capture** | ✅ pHash deduplication + idle detection + auto-pause for games | Event-driven (app switches, clicks) | Periodic snapshots | | **Search** | ✅ Semantic (MiniLM embeddings) + FTS5 keyword — hybrid fusion | Semantic + keyword + a11y tree | Semantic only (NPU) | | **Chat with memory** | ✅ Full conversational RAG with follow-ups and vision fallback | ❌ | ❌ | | **Agent system** | ✅ No-code Markdown agents + Python SDK + MCP server | Pipes (TypeScript) + MCP | ❌ | | **In-app Model Hub** | ✅ Download, switch, manage models from UI — no terminal | ❌ | ❌ | | **Encryption** | ✅ AES (Fernet) + OS keyring | Optional | TPM + BitLocker | | **PII auto-redaction** | ✅ Transparent regex — CC (Luhn-validated), SSN, API keys, passwords | AI-based PII model | Content filtering | | **Integrations** | ✅ Obsidian · Notion · Webhooks · MCP | MCP, SDK (Tauri/Electron/Swift) | Windows ecosystem only | | **Platform** | ✅ Windows · macOS · Linux (X11 + Wayland) | Windows · macOS · Linux | Windows 11 only (Copilot+ PCs) |
> **TL;DR:** ScreenMind is the only option that's fully MIT open-source, runs on any hardware (including a $150 GPU), handles vision + audio + reasoning with a single local model, and lets you actually _chat_ with your screen memory.
- * *
🧠 How Gemma 4 Is Used
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-how-gemma-4-is-used) Gemma 4 E2B is not a bolt-on — it's architecturally load-bearing. ScreenMind uses **all three modalities**:
1. Vision — Screenshot Analysis
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#1-vision--screenshot-analysis) Every screenshot is sent to Gemma 4 with OCR context. It returns structured JSON:
- App name, activity category, summary, detailed context
- Mood classification, confidence score
- Rich scene description (every visible element inventoried)
- Layout regions (sidebar, chat area, toolbar boundaries)
**Three modes**_(benchmarked on GTX 1650 4GB — scales dramatically with better GPUs):_
- **Accurate** — single call with thinking (~76s). Best layout detection.
- **Balanced** — thinking enabled, analysis-only (~40s). Richer descriptions than Fast.
- **Fast** — no-thinking prefill trick (~12s). Layout via OCR clustering instead.
**⚡ GPU Scaling — How fast on your hardware?** The numbers above are from a **GTX 1650 (4GB VRAM)** — a worst-case scenario where the model spills to CPU RAM. With more VRAM, the entire model fits on GPU and inference speeds up dramatically:
| GPU | VRAM | Bandwidth | Regime | ~Fast Mode | Why | | --- | --- | --- | --- | --- | --- | | **GTX 1650**_(baseline)_ | 4 GB | ~190 GB/s | spilling | ~12s | CPU-bottlenecked, partial offload | | **RTX 3060** | 12 GB | ~360 GB/s | full fit | ~3-4s | Spill eliminated — the big jump | | **RTX 4060 Ti** | 16 GB | ~290 GB/s | full fit | ~2-3s | Fits easily, more compute for vision | | **RTX 3090** | 24 GB | ~935 GB/s | full fit | ~1-2s | High bandwidth | | **RTX 4090** | 24 GB | ~1000 GB/s | full fit | ~1s | Top consumer card |
> **Key insight:** The biggest jump is from "spilling" (model doesn't fit in VRAM) to "full fit" (it does). Any GPU with ≥6GB VRAM should run E2B entirely on GPU and see 3-5x speedup over the baseline.
2. Audio — Voice Memos & Meeting Transcription
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#2-audio--voice-memos--meeting-transcription) Gemma 4 E2B has a native audio encoder. ScreenMind uses it for:
- Voice memo transcription (hold hotkey → speak → release)
- Meeting transcription (15s chunks, map-reduce summarization for long meetings)
No Whisper dependency. One model handles everything.
3. Reasoning — Summaries, Chat, Agents
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#3-reasoning--summaries-chat-agents)
- **Daily summaries** with deep reasoning (`think=True`)
- **Chat answers** grounded in actual screen data (text-first RAG with vision fallback)
- **Agent execution** — Gemma processes markdown agent prompts with injected screen data
Why E2B Specifically?
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#why-e2b-specifically) | Constraint | Why It Rules Out Alternatives | | --- | --- | | Must run **continuously in background** | Rules out 12B+ models (too heavy) | | Must understand **screenshots natively** | Rules out text-only models | | Must stay **100% local** for privacy | Rules out cloud APIs | | Must handle **audio natively** | Rules out models without audio encoder | | Must be **fast enough** for 30s cycle | E2B: 12-76s on GTX 1650, ~1-4s on RTX 3060+ |
Gemma 4 E2B is the only model that checks all five boxes.
- * *
🚀 Quick Start
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-quick-start) > **Requirements:** Python 3.10+ · GPU recommended (4GB+ VRAM) · ~5GB disk for model
#### 1️⃣ Clone & Install
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#1%EF%B8%8F%E2%83%A3-clone--install)
git clone https://github.com/ayushh0110/ScreenMind.git cd ScreenMind
python -m venv venv venv\Scripts\activate # Windows
source venv/bin/activate # macOS/Linux
pip install -r requirements.txt
#### 2️⃣ Run
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#2%EF%B8%8F%E2%83%A3-run)
python main.py
#### 3️⃣ Open → **http://127.0.0.1:7777**
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#3%EF%B8%8F%E2%83%A3-open--http1270017777) On first run, ScreenMind will:
- Auto-detect your GPU and download `llama-server` if not found (CUDA/CPU auto-selected)
- Open the **Model Hub** — download Gemma 4 E2B GGUF (~5GB) with progress tracking right in the UI
- Chat and Summary stay locked (🧠💤 _"I need my brain to think!"_) until the model is ready, then auto-unlock
- Start `llama-server` in background
- Show the welcome screen to set up an optional PIN
- Create `~/.screenmind/` for data storage
**⚙️ Optional: Configure via .env**
cp .env.example .env
Edit capture interval, blocked apps, hotkeys, etc.
Or configure everything from the **Settings** tab in the dashboard.
- * *
🏗️ Architecture
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#%EF%B8%8F-architecture) > For a full deep-dive into threading, caching, search internals, and the privacy pipeline, see **ARCHITECTURE.md**.
``` ┌─────────────────────────────────────────────────────────────────────┐ │ ScreenMind │ │ │ │ ┌────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │ │ │ Capture │───▶│ Async Queue │───▶│ Analysis Worker │ │ │ │ Worker │ │ (max: 100) │ │ │ │ │ │ │ └──────────────┘ │ ┌───────────────────┐ │ │ │ │ • Screen │ │ │ Per-App pHash │ │ │ │ │ • Window │ │ │ Cache (3-tier) │ │ │ │ │ • Dedup │ │ └───────────────────┘ │ │ │ │ • A11y │ │ │ │ │ │ │ • Privacy │ │ ▼ │ │ │ └────────────┘ │ ┌───────────────────┐ │ │ │ │ │ EasyOCR │ │ │ │ ┌────────────┐ │ │ (text extract) │ │ │ │ │ Audio │ │ └───────────────────┘ │ │ │ │ Worker │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ • Meeting │ │ ┌───────────────────┐ │ │ │ │ detect │ │ │ Gemma 4 E2B │ │ │ │ │ • Record │ │ │ (via llama.cpp) │ │ │ │ │ • Transcr. │ │ │ Vision + Audio │ │ │ │ └────────────┘ │ └───────────────────┘ │ │ │ │ │ │ │ │ ┌────────────┐ │ ▼ │ │ │ │ Agent │ │ ┌───────────────────┐ │ │ │ │ Scheduler │ │ │ Layout Analyzer │ │ │ │ │ │ │ │ (spatial OCR) │ │ │ │ │ • .md AI │ │ └───────────────────┘ │ │ │ │ • .py code │ │ │ │ │ │ └────────────┘ │ ▼ │ │ │ │ ┌───────────────────┐ │ │ │ │ │ MiniLM-L6-v2 │ │ │ │ │ │ (embeddings) │ │ │ │ │ └───────────────────┘ │ │ │ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────────┐ │ │ │ SQLite (WAL) │ │ │ │ + FTS5 index │ │ │ └─────────┬─────────┘ │ │ │ │ │ ┌───────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────────────────────────────────────────────────────┐ │ │ │ FastAPI REST Server │ │ │ │ /timeline · /search · /chat · /stats · /agents · /mcp │ │ │ │ │ │ │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ │ │ Web Dashboard (Vanilla JS SPA) │ │ │ │ │ │ Timeline · Chat · Search · Analytics · Agents · Settings │ │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └───────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────┘ ```
Multi-Model AI Pipeline
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#multi-model-ai-pipeline)
``` Screenshot → EasyOCR (text) → Gemma 4 E2B (understanding) → MiniLM (embeddings) → SQLite + FTS5 ↑ OCR text fed as context (Gemma sees image + reads text) ```
Four AI models working in concert, with Gemma 4 as the brain:
1. **EasyOCR** — extracts raw screen text 2. **Gemma 4 E2B** — understands what you're doing (vision + reasoning) 3. **MiniLM-L6-v2** — generates semantic vectors for natural language search 4. **FTS5** — indexes text for instant keyword search
- * *
🤖 Agent Platform
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-agent-platform) ScreenMind includes a full agent/plugin system. Build any automation on top of your screen data.
Two Modes
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#two-modes) | Mode | File Type | For | Example | | --- | --- | --- | --- | | 🤖 AI Agent | `.md` | Everyone | Write a prompt in English → Gemma runs it on your data | | 🐍 Python Plugin | `.py` | Developers | Full code with SDK access, state persistence, LLM calls |
Markdown Agent Example
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#markdown-agent-example)
--- name: Daily Focus Report schedule: every 6h data: timeline, apps, mood output: local, obsidian ---
Analyze my screen activity and generate a focus report:
- How many hours of deep work vs shallow work?
- What were my main distractions?
- Give me a focus score out of 10.
Drop this file in `~/.screenmind/agents/` — it runs automatically.
Python Plugin SDK
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#python-plugin-sdk)
from screenmind_sdk import ScreenMindSDK
sdk = ScreenMindSDK("my-tracker")
Get today's activities filtered by app
activities = sdk.get_activities(app="Chrome", limit=20)
Persistent state across runs
last_count = sdk.load_state("url_count", 0) urls = sdk.get_urls_visited() sdk.save_state("url_count", len(urls))
Ask Gemma (GPU-safe — waits for idle)
insight = sdk.ask_gemma(f"Summarize these URLs: {urls}") print(insight)
Data Selectors (Frontmatter)
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#data-selectors-frontmatter) Markdown agents declare what data they need:
| Selector | Injects | | --- | --- | | `timeline` | Recent activities with timestamps, apps, summaries | | `apps` | App usage counts + category breakdown | | `urls` | URLs visited (extracted from browser address bars) | | `meetings` | Meeting summaries and durations | | `mood` | Mood/sentiment from screen analysis |
Data injection auto-scales to your model's context window.
4 Agents Ship Built-In
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#4-agents-ship-built-in)
- **daily-journal.md** — First-person journal entry from your day
- **focus-report.md** — Focus score, deep work hours, distractions
- **meeting-actions.md** — Extract action items from meeting transcripts
- **code-changelog.md** — Summarize coding activity (commits, files, repos)
- * *
🔌 MCP Server (Claude / Cursor / VS Code)
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-mcp-server-claude--cursor--vs-code) ScreenMind exposes your screen history to any MCP-compatible AI tool:
python mcp_server.py # stdio transport
**Claude Desktop config** (`~/.claude/claude_desktop_config.json`):
{ "mcpServers": { "screenmind": { "command": "python", "args": ["C:/path/to/screenmind/mcp_server.py"] } } }
Tools Available
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#tools-available) | Tool | Description | | --- | --- | | `search_screen` | Semantic + keyword search across all history | | `get_recent_activity` | Last N activities with full details | | `get_activity_by_time` | Activities for a specific date/time range | | `get_daily_summary` | AI-generated daily summary | | `capture_now` | Trigger instant screenshot | | `get_stats` | Usage statistics | | `search_audio` | Search meeting transcripts | | `get_screenshot` | Retrieve screenshot path by activity ID |
- * *
📡 API Reference
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-api-reference) Full Swagger docs at `http://127.0.0.1:7777/docs`
Key Endpoints
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#key-endpoints) | Method | Endpoint | Description | | --- | --- | --- | | `GET` | `/api/status` | System health, worker stats | | `GET` | `/api/timeline?date=2026-05-21` | Activities for a date | | `GET` | `/api/search?q=debugging auth` | Hybrid semantic + keyword search | | `POST` | `/api/chat` | Conversational AI with screen memory (SSE stream) | | `GET` | `/api/stats?range=day` | Analytics (categories, apps, meetings) | | `GET` | `/api/rewind?date=2026-05-21` | Timelapse frames | | `POST` | `/api/summary/generate` | Generate AI daily summary | | `GET` | `/api/agents` | List all agents | | `POST` | `/api/agents/{name}/run` | Trigger agent execution | | `POST` | `/api/capture/pause` | Pause capture | | `POST` | `/api/incognito/toggle` | Toggle incognito mode |
- * *
⚙️ Configuration
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#%EF%B8%8F-configuration) All settings configurable via `.env`, environment variables, or the **Settings** dashboard (persists to `settings.json`).
| Variable | Default | Description | | --- | --- | --- | | `CAPTURE_INTERVAL` | `40` | Seconds between captures | | `ANALYSIS_MODE` | `merged` | `merged` (accurate, ~76s) or `fast` (~12s) | | `PERFORMANCE_MODE` | `balanced` | GPU layers: `minimal` / `balanced` / `maximum` | | `BLOCKED_APPS` | _(empty)_ | Comma-separated apps to never capture | | `MEETING_TRANSCRIPTION` | `false` | Auto-transcribe when meeting apps detected | | `RETENTION_DAYS` | `7` | Auto-delete data older than N days (0 = forever) | | `ENCRYPTION_ENABLED` | `false` | Encrypt screenshots at rest | | `SENSITIVE_FILTER_ENABLED` | `true` | Redact credit cards, SSNs, API keys | | `SCREENMIND_LOG_LEVEL` | `INFO` | Log verbosity: `DEBUG`, `INFO`, `WARNING`, `ERROR` | | `SCREENMIND_LOG_FILE` | _(none)_ | Path to a log file (rotating, 10MB × 3 backups) |
> See `.env.example` for the full list.
- * *
🔧 Tech Stack
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-tech-stack) | Layer | Technology | Why | | --- | --- | --- | | **Vision + Audio AI** | Gemma 4 E2B (via llama.cpp) | Only model with vision + audio + reasoning that runs locally on 4GB VRAM | | **Inference Server** | llama-server (llama.cpp) | Direct GGUF inference, OpenAI-compatible API | | **OCR** | EasyOCR | Extracts screen text fed to Gemma as context | | **Embeddings** | all-MiniLM-L6-v2 | 80MB, runs on CPU, 384-dim vectors for semantic search | | **Backend** | FastAPI + Uvicorn | Async-first, auto-generated API docs | | **Database** | SQLite (WAL) + FTS5 | Zero-config, concurrent reads, full-text search | | **Capture** | mss + ctypes/UI Automation | Native screen capture + accessibility text extraction | | **Wayland Capture** | grim (wlroots) / XDG Portal | Automatic fallback; no X11 dependency on Wayland | | **Frontend** | Vanilla JS + CSS | No build step, instant load, dark glassmorphism UI | | **Platform** | Windows / macOS / Linux (X11 + Wayland) | Abstraction layer with OS-specific adapters |
- * *
🐧 Wayland Support
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-wayland-support) ScreenMind auto-detects Wayland sessions and uses compositor-native capture:
| Compositor | Capture | Window Detection | Notes | | --- | --- | --- | --- | | **Sway** | ✅ grim | ✅ swaymsg IPC | Full support | | **Hyprland** | ✅ grim | ✅ hyprctl IPC | Full support | | **Niri** | ✅ grim | ✅ niri msg IPC | Full support | | **river / Wayfire / labwc** | ✅ grim | ⚠️ Title only (no IPC) | Capture works, app name may be unavailable | | **GNOME (Mutter)** | ⚠️ XDG Portal | ❌ No IPC available | Portal prompts on every capture — not viable for background recording | | **KDE (KWin)** | ⚠️ XDG Portal | ❌ No IPC available | Same as GNOME |
**Install grim** (recommended for wlroots compositors):
Arch
sudo pacman -S grim
Ubuntu / Debian (if available)
sudo apt install grim
Fedora
sudo dnf install grim
**GNOME / KDE Wayland**: Best-effort only. Screenshots use the XDG Desktop Portal, which prompts for permission on each capture — not viable for continuous background recording. For full functionality, use an X11 session or a wlroots-based compositor with grim.
**Optional** (for portal fallback): `python3-gi` / `python-gobject` system package.
- * *
📁 Project Structure
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-project-structure)
``` screenmind/ ├── main.py # Entry point — starts all services ├── config.py # Pydantic settings (env + runtime overrides) ├── setup_llama.py # Auto-detect + install llama-server ├── requirements.txt # Full Python dependencies ├── requirements-test.txt # Lightweight CI deps (no PyTorch) ├── mcp_server.py # MCP server for Claude/Cursor/VS Code ├── screenmind_sdk.py # SDK for Python plugin agents │ ├── capture/ # Screenshot capture layer │ ├── screen.py # Capture facade (mss / Wayland backend) │ ├── wayland.py # Wayland backend (grim / XDG Portal) │ ├── window.py # Active window detection │ ├── dedup.py # Perceptual hash deduplication │ ├── hotkey.py # Global hotkeys (bookmark, pause, voice) │ └── voice_recorder.py # Mic recording for voice memos │ ├── engine/ # AI & intelligence layer │ ├── analyzer.py # Gemma 4 vision analysis (dual mode) │ ├── llm_client.py # llama-server client (chat, vision, audio) │ ├── model_manager.py # Server lifecycle, model download/switch │ ├── embedder.py # MiniLM semantic embeddings │ ├── ocr.py # EasyOCR text extraction │ ├── layout_analyzer.py # Spatial OCR organization │ ├── dev_context.py # Git repo/branch/diff detection │ ├── a11y_extractor.py # Accessibility API text extraction │ └── agent_runner.py # Agent scheduling & execution │ ├── workers/ # Background processing │ ├── capture_worker.py # Smart capture loop + privacy filtering │ ├── analysis_worker.py # OCR → Gemma → Layout → Embed → Store │ └── audio_worker.py # Meeting detection & transcription │ ├── storage/ # Data persistence │ ├── database.py # SQLite + FTS5 + migrations │ └── models.py # Pydantic data models │ ├── privacy/ # Privacy & security │ ├── encryption.py # Fernet AES encryption at rest │ └── data_filter.py # Sensitive data redaction │ ├── platform_support/ # Cross-platform abstraction │ ├── windows.py # Win32 + UI Automation │ ├── macos.py # AppKit + AXUIElement │ └── linux.py # xdotool + AT-SPI │ ├── integrations/ # External connections │ ├── obsidian.py # Vault markdown export │ ├── notion.py # Notion API export │ ├── webhooks.py # HTTP webhooks (HMAC, retry) │ └── smart_notify.py # Distraction/break notifications │ ├── api/ # REST API + dashboard │ ├── server.py # FastAPI app + auth middleware │ ├── dependencies.py # Shared state for routes │ ├── routes/ # 16 route modules │ └── static/ # Web dashboard (HTML + CSS + JS) │ ├── default_agents/ # 4 built-in agents │ ├── daily-journal.md │ ├── focus-report.md │ ├── meeting-actions.md │ └── code-changelog.md │ ├── tests/ # pytest test suite (25 modules) │ ├── conftest.py # Shared fixtures │ └── test_*.py # Unit + integration tests │ └── docs/ └── BUILD_YOUR_OWN_AGENT.md ```
- * *
🛡️ Error Handling & Resilience
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#%EF%B8%8F-error-handling--resilience) | Scenario | Behavior | | --- | --- | | **llama-server not found** | Auto-downloads correct binary from GitHub releases (CUDA/CPU auto-detected). Checks disk space first. | | **Model not downloaded** | Model Hub shows lock screen with download cards. Progress tracked in UI. Chat/Summary locked until ready. | | **GPU out of memory** | Detects OOM, retries with delay, re-queues on persistent failure. | | **Duplicate frames** | pHash dedup skips identical screenshots (threshold: 8 hamming distance). | | **Stale queue items** | Captures >3 min old auto-skipped. Backfilled during idle. | | **App in blocklist** | Silently skips — no screenshot saved. | | **Meeting app closed** | Process-alive check + silence detection + 5-min hard timeout. | | **Chat during analysis** | Cancels in-flight inference, frees GPU in <1s, re-queues analysis. | | **Crash recovery** | Stale meetings cleaned on startup. Unanalyzed entries backfilled. |
- * *
🎨 Dashboard
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-dashboard) The web dashboard at `http://127.0.0.1:7777` features:
- **Timeline** — Browse activities by date with thumbnails, AI summaries, category badges
- **Chat** — Conversational AI with screen memory. Ask anything about your history. Locked with 🧠💤 brain animation until model is ready.
- **Search** — Semantic + keyword hybrid search with OCR highlighting on screenshots
- **Analytics** — Category charts, top apps, hourly heatmap, meeting stats
- **Rewind** — Timelapse player with play/pause/scrub/speed controls
- **Memos** — Voice memo list with audio player
- **Agents** — Create, edit, run, and monitor agents
- **Settings** — Model Hub (download/switch models with progress), Shortcuts, Capture, AI, Audio, Privacy, Automation, Integrations, Storage
Dark glassmorphism UI. No build step. Instant load.
- * *
🧪 Development
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-development) Run the test suite:
Fast (lightweight deps — same as CI, ~2 min install)
pip install -r requirements-test.txt pytest --cov=. --cov-report=term-missing -q
Full (includes ML models — sentence-transformers, easyocr)
pip install -r requirements.txt pip install pytest pytest-asyncio pytest-cov pytest --cov=. --cov-report=term-missing -q
CI runs automatically on push/PR via GitHub Actions using the lightweight deps.
- * *
🤝 Contributing
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-contributing) Contributions welcome! Here are some high-impact areas:
- 🍎 **macOS/Linux testing** — platform adapters exist, need real hardware testing
- 🐳 **Docker container** — one-command setup
- 🧩 **Community agent registry** — share agents between users
- 🌐 **Browser extension** — richer URL/tab context
- 📤 **Export formats** — Markdown, CSV, JSON
- * *
⭐ Show Your Support
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-show-your-support) If you find ScreenMind useful, please consider:
- **⭐ Star this repo** — it helps others discover the project
- **🍴 Fork it** — build your own agents and features
- **🐛 Report issues** — help us improve
- **📣 Share it** — tell others about privacy-first AI

- * *
📝 License
[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-license) MIT License — see LICENSE for details.
- * *
**Built with 🧠 Gemma 4 E2B · 🔒 100% Local · 🚀 Zero Cloud Dependencies**
_Vision + Audio + Reasoning — all three modalities, one model, your machine._
Made with ❤️ by ayushh0110