ScreenMind/README.md at main · ayushh0110/ScreenMind

Source: https://github.com/ayushh0110/ScreenMind/blob/main/README.md

![Image 1: ScreenMind](https://camo.githubusercontent.com/853212f5e82c3a9cb6462c61e4761aa9e91f9b57c8d39e6375ae61ea2c8bd8ff/68747470733a2f2f696d672e736869656c64732e696f2f62616467652ff09fa7a05f53637265656e4d696e642d596f75725f41495f4d656d6f72792d3842354346363f7374796c653d666f722d7468652d6261646765266c6162656c436f6c6f723d306130653161)

**Captures your screen → Analyzes with Gemma 4 → Builds a searchable AI memory**

**100% local. 100% private. Zero cloud dependencies.**

![Image 2: CI](https://github.com/ayushh0110/ScreenMind/actions/workflows/ci.yml)![Image 3: codecov](https://codecov.io/gh/ayushh0110/ScreenMind)![Image 4: Python 3.10+](https://python.org/)![Image 5: Gemma 4 E2B](https://ai.google.dev/gemma)![Image 6: llama.cpp](https://github.com/ggerganov/llama.cpp)![Image 7: License MIT](https://github.com/ayushh0110/ScreenMind/blob/main/LICENSE)![Image 8: MCP Ready](https://github.com/ayushh0110/ScreenMind/blob/main/MCP_SETUP.md)

**Features** · **Comparison** · **Gemma 4 Deep Dive** · **Quick Start** · **Architecture** · **Agent Platform** · **MCP** · **API**

![Image 9: Timeline — AI-analyzed screen activity feed](https://github.com/ayushh0110/ScreenMind/blob/main/docs/screenshots/timeline.png)

| Agents | | --- | | ![Image 10: Agents](https://github.com/ayushh0110/ScreenMind/blob/main/docs/screenshots/agents.png) |

**💬 Chat in Action** — _Ask anything about your screen history_

![Image 11: Chat Demo — conversational AI with screen memory](https://github.com/ayushh0110/ScreenMind/blob/main/docs/screenshots/chat-demo.gif)

> **Microsoft showed the world wants screen-aware AI with Recall.** But Recall stores data in plaintext, sends telemetry, and was met with massive privacy backlash. ScreenMind is the open-source, privacy-first alternative — every screenshot analyzed, every insight generated, every search result — all computed locally using Gemma 4's multimodal capabilities. > > > It's not just a screen recorder. It's an **AI memory** you can talk to, search through, and build automations on top of.

- * *

✨ Features

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-features)

🧠 Core Intelligence

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-core-intelligence)

- **📸 Smart Capture** — Content-change detection, not a fixed timer. Captures when your screen _actually_ changes.

- **🔬 Gemma 4 Vision Analysis** — Every screenshot analyzed: app detection, activity categorization, mood, scene description, spatial layout regions.

- **🔍 Hybrid Search** — Semantic embeddings (MiniLM) + FTS5 keyword search. Find anything by _meaning_, not just keywords.

- **💬 Chat with Memory** — Conversational RAG with follow-up support. Ask "what did Alex say on Discord?" → get the actual message.

- **🧠 Model Hub** — In-app model download with live progress tracking. Chat and Summary are locked with witty brain animations until the model is ready — then auto-unlock. No terminal needed.

- **🎙️ Voice Memos** — Hold `Ctrl+Shift+V` → Gemma 4's native audio encoder transcribes. Screenshot captured alongside.

- **🎤 Meeting Transcription** — Auto-detects Zoom/Teams/Meet, records audio, transcribes, generates structured summaries.

- **📊 Analytics Dashboard** — Category breakdown, top apps, hourly heatmap, meeting stats, focus metrics.

- **⏪ Day Rewind** — Timelapse playback of your entire day with play/pause/scrub/speed controls.

⚡ Performance

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-performance)

- **Three Analysis Modes** — Accurate (~76s, deep thinking + layout), Balanced (~40s, thinking), or Fast (~12s, no thinking). You choose.

- **Per-App pHash Cache** — 3-tier caching with app-aware staleness. Communication apps refresh faster than IDEs. Significantly fewer inference calls.

- **Chat-First GPU Priority** — Chat cancels in-flight analysis instantly. GPU freed in <1s.

- **Auto-Pause Heavy Apps** — Games, video editors, 3D software detected → capture pauses automatically.

🔒 Privacy & Security

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-privacy--security)

- **100% Local** — All data stays on your machine. Zero network calls after initial model download. No telemetry. Ever.

- **Sensitive Data Filter** — Auto-redacts credit cards, SSNs, API keys, passwords before storage.

- **Encryption at Rest** — AES encryption for screenshots (Fernet + OS keyring).

- **Dashboard PIN Lock** — Session-based auth with configurable auto-lock timeout.

- **Incognito Mode** — One-click pause. Nothing recorded.

**🔌 Integrations & Extensibility**

| Integration | Description | | --- | --- | | 🤖 **Agent Platform** | Build automations in Markdown (English) or Python. Drop a file, get an agent. | | 🔌 **MCP Server** | Expose screen history to Claude Desktop, Cursor, VS Code | | 📓 **Obsidian** | Auto-sync daily summaries to your vault | | 📋 **Notion** | Push summaries to a Notion database | | 🪝 **Webhooks** | Fire events to Slack, Discord, IFTTT (HMAC signed, auto-retry) | | 🔔 **Smart Notifications** | Distraction alerts, break reminders | | ⭐ **Auto-Bookmark** | Keyword triggers (`git push`, `deploy`) auto-flag important moments |

⌨️ System-Wide Hotkeys

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#%EF%B8%8F-system-wide-hotkeys) | Hotkey | Action | | --- | --- | | `Ctrl+Shift+B` | 📸 Instant bookmarked capture | | `Ctrl+Shift+P` | ⏸ Toggle pause/resume | | `Ctrl+Shift+V` | 🎤 Hold to record voice memo |

> All hotkeys customizable from Settings.

- * *

📊 How ScreenMind Compares

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-how-screenmind-compares) | Feature | **ScreenMind** | **Screenpipe** | **Microsoft Recall** | | --- | --- | --- | --- | | **License** | ✅ MIT (fully open-source) | Source-available (commercial license required for business use) | Proprietary | | **Cost** | ✅ Free forever | Free (personal) / Paid (commercial) | Requires $1000+ Copilot+ PC | | **Privacy** | ✅ Zero network calls. Zero telemetry. Ever. | Local-first, optional cloud | Telemetry opt-in. Data stayed local after backlash. | | **Min. hardware** | ✅ Any GPU ≥4GB VRAM (or CPU-only) | 8GB RAM, modern CPU | 40 TOPS NPU + 16GB RAM + BitLocker + Windows Hello | | **AI architecture** | ✅ Single model — Gemma 4 does vision + audio + reasoning | Multiple models — OCR + Whisper + external LLM | Proprietary NPU model | | **Audio/meetings** | ✅ Native — Gemma 4 audio encoder (no Whisper needed) | Whisper-based transcription | ❌ Not supported | | **Smart capture** | ✅ pHash deduplication + idle detection + auto-pause for games | Event-driven (app switches, clicks) | Periodic snapshots | | **Search** | ✅ Semantic (MiniLM embeddings) + FTS5 keyword — hybrid fusion | Semantic + keyword + a11y tree | Semantic only (NPU) | | **Chat with memory** | ✅ Full conversational RAG with follow-ups and vision fallback | ❌ | ❌ | | **Agent system** | ✅ No-code Markdown agents + Python SDK + MCP server | Pipes (TypeScript) + MCP | ❌ | | **In-app Model Hub** | ✅ Download, switch, manage models from UI — no terminal | ❌ | ❌ | | **Encryption** | ✅ AES (Fernet) + OS keyring | Optional | TPM + BitLocker | | **PII auto-redaction** | ✅ Transparent regex — CC (Luhn-validated), SSN, API keys, passwords | AI-based PII model | Content filtering | | **Integrations** | ✅ Obsidian · Notion · Webhooks · MCP | MCP, SDK (Tauri/Electron/Swift) | Windows ecosystem only | | **Platform** | ✅ Windows · macOS · Linux (X11 + Wayland) | Windows · macOS · Linux | Windows 11 only (Copilot+ PCs) |

> **TL;DR:** ScreenMind is the only option that's fully MIT open-source, runs on any hardware (including a $150 GPU), handles vision + audio + reasoning with a single local model, and lets you actually _chat_ with your screen memory.

- * *

🧠 How Gemma 4 Is Used

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-how-gemma-4-is-used) Gemma 4 E2B is not a bolt-on — it's architecturally load-bearing. ScreenMind uses **all three modalities**:

1. Vision — Screenshot Analysis

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#1-vision--screenshot-analysis) Every screenshot is sent to Gemma 4 with OCR context. It returns structured JSON:

- App name, activity category, summary, detailed context

- Mood classification, confidence score

- Rich scene description (every visible element inventoried)

- Layout regions (sidebar, chat area, toolbar boundaries)

**Three modes**_(benchmarked on GTX 1650 4GB — scales dramatically with better GPUs):_

- **Accurate** — single call with thinking (~76s). Best layout detection.

- **Balanced** — thinking enabled, analysis-only (~40s). Richer descriptions than Fast.

- **Fast** — no-thinking prefill trick (~12s). Layout via OCR clustering instead.

**⚡ GPU Scaling — How fast on your hardware?** The numbers above are from a **GTX 1650 (4GB VRAM)** — a worst-case scenario where the model spills to CPU RAM. With more VRAM, the entire model fits on GPU and inference speeds up dramatically:

| GPU | VRAM | Bandwidth | Regime | ~Fast Mode | Why | | --- | --- | --- | --- | --- | --- | | **GTX 1650**_(baseline)_ | 4 GB | ~190 GB/s | spilling | ~12s | CPU-bottlenecked, partial offload | | **RTX 3060** | 12 GB | ~360 GB/s | full fit | ~3-4s | Spill eliminated — the big jump | | **RTX 4060 Ti** | 16 GB | ~290 GB/s | full fit | ~2-3s | Fits easily, more compute for vision | | **RTX 3090** | 24 GB | ~935 GB/s | full fit | ~1-2s | High bandwidth | | **RTX 4090** | 24 GB | ~1000 GB/s | full fit | ~1s | Top consumer card |

> **Key insight:** The biggest jump is from "spilling" (model doesn't fit in VRAM) to "full fit" (it does). Any GPU with ≥6GB VRAM should run E2B entirely on GPU and see 3-5x speedup over the baseline.

2. Audio — Voice Memos & Meeting Transcription

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#2-audio--voice-memos--meeting-transcription) Gemma 4 E2B has a native audio encoder. ScreenMind uses it for:

- Voice memo transcription (hold hotkey → speak → release)

- Meeting transcription (15s chunks, map-reduce summarization for long meetings)

No Whisper dependency. One model handles everything.

3. Reasoning — Summaries, Chat, Agents

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#3-reasoning--summaries-chat-agents)

- **Daily summaries** with deep reasoning (`think=True`)

- **Chat answers** grounded in actual screen data (text-first RAG with vision fallback)

- **Agent execution** — Gemma processes markdown agent prompts with injected screen data

Why E2B Specifically?

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#why-e2b-specifically) | Constraint | Why It Rules Out Alternatives | | --- | --- | | Must run **continuously in background** | Rules out 12B+ models (too heavy) | | Must understand **screenshots natively** | Rules out text-only models | | Must stay **100% local** for privacy | Rules out cloud APIs | | Must handle **audio natively** | Rules out models without audio encoder | | Must be **fast enough** for 30s cycle | E2B: 12-76s on GTX 1650, ~1-4s on RTX 3060+ |

Gemma 4 E2B is the only model that checks all five boxes.

- * *

🚀 Quick Start

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-quick-start) > **Requirements:** Python 3.10+ · GPU recommended (4GB+ VRAM) · ~5GB disk for model

#### 1️⃣ Clone & Install

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#1%EF%B8%8F%E2%83%A3-clone--install)

git clone https://github.com/ayushh0110/ScreenMind.git cd ScreenMind

python -m venv venv venv\Scripts\activate # Windows

source venv/bin/activate # macOS/Linux

pip install -r requirements.txt

#### 2️⃣ Run

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#2%EF%B8%8F%E2%83%A3-run)

python main.py

#### 3️⃣ Open → **http://127.0.0.1:7777**

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#3%EF%B8%8F%E2%83%A3-open--http1270017777) On first run, ScreenMind will:

- Auto-detect your GPU and download `llama-server` if not found (CUDA/CPU auto-selected)

- Open the **Model Hub** — download Gemma 4 E2B GGUF (~5GB) with progress tracking right in the UI

- Chat and Summary stay locked (🧠💤 _"I need my brain to think!"_) until the model is ready, then auto-unlock

- Start `llama-server` in background

- Show the welcome screen to set up an optional PIN

- Create `~/.screenmind/` for data storage

**⚙️ Optional: Configure via .env**

cp .env.example .env

Edit capture interval, blocked apps, hotkeys, etc.

Or configure everything from the **Settings** tab in the dashboard.

- * *

🏗️ Architecture

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#%EF%B8%8F-architecture) > For a full deep-dive into threading, caching, search internals, and the privacy pipeline, see **ARCHITECTURE.md**.

``` ┌─────────────────────────────────────────────────────────────────────┐ │ ScreenMind │ │ │ │ ┌────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │ │ │ Capture │───▶│ Async Queue │───▶│ Analysis Worker │ │ │ │ Worker │ │ (max: 100) │ │ │ │ │ │ │ └──────────────┘ │ ┌───────────────────┐ │ │ │ │ • Screen │ │ │ Per-App pHash │ │ │ │ │ • Window │ │ │ Cache (3-tier) │ │ │ │ │ • Dedup │ │ └───────────────────┘ │ │ │ │ • A11y │ │ │ │ │ │ │ • Privacy │ │ ▼ │ │ │ └────────────┘ │ ┌───────────────────┐ │ │ │ │ │ EasyOCR │ │ │ │ ┌────────────┐ │ │ (text extract) │ │ │ │ │ Audio │ │ └───────────────────┘ │ │ │ │ Worker │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ • Meeting │ │ ┌───────────────────┐ │ │ │ │ detect │ │ │ Gemma 4 E2B │ │ │ │ │ • Record │ │ │ (via llama.cpp) │ │ │ │ │ • Transcr. │ │ │ Vision + Audio │ │ │ │ └────────────┘ │ └───────────────────┘ │ │ │ │ │ │ │ │ ┌────────────┐ │ ▼ │ │ │ │ Agent │ │ ┌───────────────────┐ │ │ │ │ Scheduler │ │ │ Layout Analyzer │ │ │ │ │ │ │ │ (spatial OCR) │ │ │ │ │ • .md AI │ │ └───────────────────┘ │ │ │ │ • .py code │ │ │ │ │ │ └────────────┘ │ ▼ │ │ │ │ ┌───────────────────┐ │ │ │ │ │ MiniLM-L6-v2 │ │ │ │ │ │ (embeddings) │ │ │ │ │ └───────────────────┘ │ │ │ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────────┐ │ │ │ SQLite (WAL) │ │ │ │ + FTS5 index │ │ │ └─────────┬─────────┘ │ │ │ │ │ ┌───────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────────────────────────────────────────────────────┐ │ │ │ FastAPI REST Server │ │ │ │ /timeline · /search · /chat · /stats · /agents · /mcp │ │ │ │ │ │ │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ │ │ Web Dashboard (Vanilla JS SPA) │ │ │ │ │ │ Timeline · Chat · Search · Analytics · Agents · Settings │ │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └───────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────┘ ```

Multi-Model AI Pipeline

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#multi-model-ai-pipeline)

``` Screenshot → EasyOCR (text) → Gemma 4 E2B (understanding) → MiniLM (embeddings) → SQLite + FTS5 ↑ OCR text fed as context (Gemma sees image + reads text) ```

Four AI models working in concert, with Gemma 4 as the brain:

1. **EasyOCR** — extracts raw screen text 2. **Gemma 4 E2B** — understands what you're doing (vision + reasoning) 3. **MiniLM-L6-v2** — generates semantic vectors for natural language search 4. **FTS5** — indexes text for instant keyword search

- * *

🤖 Agent Platform

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-agent-platform) ScreenMind includes a full agent/plugin system. Build any automation on top of your screen data.

Two Modes

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#two-modes) | Mode | File Type | For | Example | | --- | --- | --- | --- | | 🤖 AI Agent | `.md` | Everyone | Write a prompt in English → Gemma runs it on your data | | 🐍 Python Plugin | `.py` | Developers | Full code with SDK access, state persistence, LLM calls |

Markdown Agent Example

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#markdown-agent-example)

--- name: Daily Focus Report schedule: every 6h data: timeline, apps, mood output: local, obsidian ---

Analyze my screen activity and generate a focus report:

- How many hours of deep work vs shallow work?

- What were my main distractions?

- Give me a focus score out of 10.

Drop this file in `~/.screenmind/agents/` — it runs automatically.

Python Plugin SDK

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#python-plugin-sdk)

from screenmind_sdk import ScreenMindSDK

sdk = ScreenMindSDK("my-tracker")

Get today's activities filtered by app

activities = sdk.get_activities(app="Chrome", limit=20)

Persistent state across runs

last_count = sdk.load_state("url_count", 0) urls = sdk.get_urls_visited() sdk.save_state("url_count", len(urls))

Ask Gemma (GPU-safe — waits for idle)

insight = sdk.ask_gemma(f"Summarize these URLs: {urls}") print(insight)

Data Selectors (Frontmatter)

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#data-selectors-frontmatter) Markdown agents declare what data they need:

| Selector | Injects | | --- | --- | | `timeline` | Recent activities with timestamps, apps, summaries | | `apps` | App usage counts + category breakdown | | `urls` | URLs visited (extracted from browser address bars) | | `meetings` | Meeting summaries and durations | | `mood` | Mood/sentiment from screen analysis |

Data injection auto-scales to your model's context window.

4 Agents Ship Built-In

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#4-agents-ship-built-in)

- **daily-journal.md** — First-person journal entry from your day

- **focus-report.md** — Focus score, deep work hours, distractions

- **meeting-actions.md** — Extract action items from meeting transcripts

- **code-changelog.md** — Summarize coding activity (commits, files, repos)

- * *

🔌 MCP Server (Claude / Cursor / VS Code)

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-mcp-server-claude--cursor--vs-code) ScreenMind exposes your screen history to any MCP-compatible AI tool:

python mcp_server.py # stdio transport

**Claude Desktop config** (`~/.claude/claude_desktop_config.json`):

{ "mcpServers": { "screenmind": { "command": "python", "args": ["C:/path/to/screenmind/mcp_server.py"] } } }

Tools Available

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#tools-available) | Tool | Description | | --- | --- | | `search_screen` | Semantic + keyword search across all history | | `get_recent_activity` | Last N activities with full details | | `get_activity_by_time` | Activities for a specific date/time range | | `get_daily_summary` | AI-generated daily summary | | `capture_now` | Trigger instant screenshot | | `get_stats` | Usage statistics | | `search_audio` | Search meeting transcripts | | `get_screenshot` | Retrieve screenshot path by activity ID |

- * *

📡 API Reference

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-api-reference) Full Swagger docs at `http://127.0.0.1:7777/docs`

Key Endpoints

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#key-endpoints) | Method | Endpoint | Description | | --- | --- | --- | | `GET` | `/api/status` | System health, worker stats | | `GET` | `/api/timeline?date=2026-05-21` | Activities for a date | | `GET` | `/api/search?q=debugging auth` | Hybrid semantic + keyword search | | `POST` | `/api/chat` | Conversational AI with screen memory (SSE stream) | | `GET` | `/api/stats?range=day` | Analytics (categories, apps, meetings) | | `GET` | `/api/rewind?date=2026-05-21` | Timelapse frames | | `POST` | `/api/summary/generate` | Generate AI daily summary | | `GET` | `/api/agents` | List all agents | | `POST` | `/api/agents/{name}/run` | Trigger agent execution | | `POST` | `/api/capture/pause` | Pause capture | | `POST` | `/api/incognito/toggle` | Toggle incognito mode |

- * *

⚙️ Configuration

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#%EF%B8%8F-configuration) All settings configurable via `.env`, environment variables, or the **Settings** dashboard (persists to `settings.json`).

| Variable | Default | Description | | --- | --- | --- | | `CAPTURE_INTERVAL` | `40` | Seconds between captures | | `ANALYSIS_MODE` | `merged` | `merged` (accurate, ~76s) or `fast` (~12s) | | `PERFORMANCE_MODE` | `balanced` | GPU layers: `minimal` / `balanced` / `maximum` | | `BLOCKED_APPS` | _(empty)_ | Comma-separated apps to never capture | | `MEETING_TRANSCRIPTION` | `false` | Auto-transcribe when meeting apps detected | | `RETENTION_DAYS` | `7` | Auto-delete data older than N days (0 = forever) | | `ENCRYPTION_ENABLED` | `false` | Encrypt screenshots at rest | | `SENSITIVE_FILTER_ENABLED` | `true` | Redact credit cards, SSNs, API keys | | `SCREENMIND_LOG_LEVEL` | `INFO` | Log verbosity: `DEBUG`, `INFO`, `WARNING`, `ERROR` | | `SCREENMIND_LOG_FILE` | _(none)_ | Path to a log file (rotating, 10MB × 3 backups) |

> See `.env.example` for the full list.

- * *

🔧 Tech Stack

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-tech-stack) | Layer | Technology | Why | | --- | --- | --- | | **Vision + Audio AI** | Gemma 4 E2B (via llama.cpp) | Only model with vision + audio + reasoning that runs locally on 4GB VRAM | | **Inference Server** | llama-server (llama.cpp) | Direct GGUF inference, OpenAI-compatible API | | **OCR** | EasyOCR | Extracts screen text fed to Gemma as context | | **Embeddings** | all-MiniLM-L6-v2 | 80MB, runs on CPU, 384-dim vectors for semantic search | | **Backend** | FastAPI + Uvicorn | Async-first, auto-generated API docs | | **Database** | SQLite (WAL) + FTS5 | Zero-config, concurrent reads, full-text search | | **Capture** | mss + ctypes/UI Automation | Native screen capture + accessibility text extraction | | **Wayland Capture** | grim (wlroots) / XDG Portal | Automatic fallback; no X11 dependency on Wayland | | **Frontend** | Vanilla JS + CSS | No build step, instant load, dark glassmorphism UI | | **Platform** | Windows / macOS / Linux (X11 + Wayland) | Abstraction layer with OS-specific adapters |

- * *

🐧 Wayland Support

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-wayland-support) ScreenMind auto-detects Wayland sessions and uses compositor-native capture:

| Compositor | Capture | Window Detection | Notes | | --- | --- | --- | --- | | **Sway** | ✅ grim | ✅ swaymsg IPC | Full support | | **Hyprland** | ✅ grim | ✅ hyprctl IPC | Full support | | **Niri** | ✅ grim | ✅ niri msg IPC | Full support | | **river / Wayfire / labwc** | ✅ grim | ⚠️ Title only (no IPC) | Capture works, app name may be unavailable | | **GNOME (Mutter)** | ⚠️ XDG Portal | ❌ No IPC available | Portal prompts on every capture — not viable for background recording | | **KDE (KWin)** | ⚠️ XDG Portal | ❌ No IPC available | Same as GNOME |

**Install grim** (recommended for wlroots compositors):

Arch

sudo pacman -S grim

Ubuntu / Debian (if available)

sudo apt install grim

Fedora

sudo dnf install grim

**GNOME / KDE Wayland**: Best-effort only. Screenshots use the XDG Desktop Portal, which prompts for permission on each capture — not viable for continuous background recording. For full functionality, use an X11 session or a wlroots-based compositor with grim.

**Optional** (for portal fallback): `python3-gi` / `python-gobject` system package.

- * *

📁 Project Structure

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-project-structure)

``` screenmind/ ├── main.py # Entry point — starts all services ├── config.py # Pydantic settings (env + runtime overrides) ├── setup_llama.py # Auto-detect + install llama-server ├── requirements.txt # Full Python dependencies ├── requirements-test.txt # Lightweight CI deps (no PyTorch) ├── mcp_server.py # MCP server for Claude/Cursor/VS Code ├── screenmind_sdk.py # SDK for Python plugin agents │ ├── capture/ # Screenshot capture layer │ ├── screen.py # Capture facade (mss / Wayland backend) │ ├── wayland.py # Wayland backend (grim / XDG Portal) │ ├── window.py # Active window detection │ ├── dedup.py # Perceptual hash deduplication │ ├── hotkey.py # Global hotkeys (bookmark, pause, voice) │ └── voice_recorder.py # Mic recording for voice memos │ ├── engine/ # AI & intelligence layer │ ├── analyzer.py # Gemma 4 vision analysis (dual mode) │ ├── llm_client.py # llama-server client (chat, vision, audio) │ ├── model_manager.py # Server lifecycle, model download/switch │ ├── embedder.py # MiniLM semantic embeddings │ ├── ocr.py # EasyOCR text extraction │ ├── layout_analyzer.py # Spatial OCR organization │ ├── dev_context.py # Git repo/branch/diff detection │ ├── a11y_extractor.py # Accessibility API text extraction │ └── agent_runner.py # Agent scheduling & execution │ ├── workers/ # Background processing │ ├── capture_worker.py # Smart capture loop + privacy filtering │ ├── analysis_worker.py # OCR → Gemma → Layout → Embed → Store │ └── audio_worker.py # Meeting detection & transcription │ ├── storage/ # Data persistence │ ├── database.py # SQLite + FTS5 + migrations │ └── models.py # Pydantic data models │ ├── privacy/ # Privacy & security │ ├── encryption.py # Fernet AES encryption at rest │ └── data_filter.py # Sensitive data redaction │ ├── platform_support/ # Cross-platform abstraction │ ├── windows.py # Win32 + UI Automation │ ├── macos.py # AppKit + AXUIElement │ └── linux.py # xdotool + AT-SPI │ ├── integrations/ # External connections │ ├── obsidian.py # Vault markdown export │ ├── notion.py # Notion API export │ ├── webhooks.py # HTTP webhooks (HMAC, retry) │ └── smart_notify.py # Distraction/break notifications │ ├── api/ # REST API + dashboard │ ├── server.py # FastAPI app + auth middleware │ ├── dependencies.py # Shared state for routes │ ├── routes/ # 16 route modules │ └── static/ # Web dashboard (HTML + CSS + JS) │ ├── default_agents/ # 4 built-in agents │ ├── daily-journal.md │ ├── focus-report.md │ ├── meeting-actions.md │ └── code-changelog.md │ ├── tests/ # pytest test suite (25 modules) │ ├── conftest.py # Shared fixtures │ └── test_*.py # Unit + integration tests │ └── docs/ └── BUILD_YOUR_OWN_AGENT.md ```

- * *

🛡️ Error Handling & Resilience

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#%EF%B8%8F-error-handling--resilience) | Scenario | Behavior | | --- | --- | | **llama-server not found** | Auto-downloads correct binary from GitHub releases (CUDA/CPU auto-detected). Checks disk space first. | | **Model not downloaded** | Model Hub shows lock screen with download cards. Progress tracked in UI. Chat/Summary locked until ready. | | **GPU out of memory** | Detects OOM, retries with delay, re-queues on persistent failure. | | **Duplicate frames** | pHash dedup skips identical screenshots (threshold: 8 hamming distance). | | **Stale queue items** | Captures >3 min old auto-skipped. Backfilled during idle. | | **App in blocklist** | Silently skips — no screenshot saved. | | **Meeting app closed** | Process-alive check + silence detection + 5-min hard timeout. | | **Chat during analysis** | Cancels in-flight inference, frees GPU in <1s, re-queues analysis. | | **Crash recovery** | Stale meetings cleaned on startup. Unanalyzed entries backfilled. |

- * *

🎨 Dashboard

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-dashboard) The web dashboard at `http://127.0.0.1:7777` features:

- **Timeline** — Browse activities by date with thumbnails, AI summaries, category badges

- **Chat** — Conversational AI with screen memory. Ask anything about your history. Locked with 🧠💤 brain animation until model is ready.

- **Search** — Semantic + keyword hybrid search with OCR highlighting on screenshots

- **Analytics** — Category charts, top apps, hourly heatmap, meeting stats

- **Rewind** — Timelapse player with play/pause/scrub/speed controls

- **Memos** — Voice memo list with audio player

- **Agents** — Create, edit, run, and monitor agents

- **Settings** — Model Hub (download/switch models with progress), Shortcuts, Capture, AI, Audio, Privacy, Automation, Integrations, Storage

Dark glassmorphism UI. No build step. Instant load.

- * *

🧪 Development

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-development) Run the test suite:

Fast (lightweight deps — same as CI, ~2 min install)

pip install -r requirements-test.txt pytest --cov=. --cov-report=term-missing -q

Full (includes ML models — sentence-transformers, easyocr)

pip install -r requirements.txt pip install pytest pytest-asyncio pytest-cov pytest --cov=. --cov-report=term-missing -q

CI runs automatically on push/PR via GitHub Actions using the lightweight deps.

- * *

🤝 Contributing

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-contributing) Contributions welcome! Here are some high-impact areas:

- 🍎 **macOS/Linux testing** — platform adapters exist, need real hardware testing

- 🐳 **Docker container** — one-command setup

- 🧩 **Community agent registry** — share agents between users

- 🌐 **Browser extension** — richer URL/tab context

- 📤 **Export formats** — Markdown, CSV, JSON

- * *

⭐ Show Your Support

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-show-your-support) If you find ScreenMind useful, please consider:

- **⭐ Star this repo** — it helps others discover the project

- **🍴 Fork it** — build your own agents and features

- **🐛 Report issues** — help us improve

- **📣 Share it** — tell others about privacy-first AI

![Image 12: Stars](https://github.com/ayushh0110/ScreenMind/stargazers)![Image 13: Forks](https://github.com/ayushh0110/ScreenMind/network/members)

- * *

📝 License

[](https://github.com/ayushh0110/ScreenMind/blob/main/README.md#-license) MIT License — see LICENSE for details.

- * *

**Built with 🧠 Gemma 4 E2B · 🔒 100% Local · 🚀 Zero Cloud Dependencies**

_Vision + Audio + Reasoning — all three modalities, one model, your machine._

Made with ❤️ by ayushh0110