Evra Backend Architecture

Technical Walkthrough & System Design

1. System Overview: End-to-End Flow
2. Insights Layer: Signals & Synthesis
3. State & Memory: Representation & Evolution
4. Coaching Layer: Logic & Goal Setting
5. Prompt Engineering: Composition & Safety
6. Agentic Layer: Decision & Execution
7. Safety & Guardrails: Constraints & Authority
8. Evaluation & Learnings: Real-World Performance
9. Failure Modes: Detection & Mitigation
10. What's Non-Trivial: Integration Challenges

1. System Overview: End-to-End Flow

The Evra backend operates as a closed-loop clinical intelligence system. It ingests multi-modal data, synthesizes it into actionable context, creates structured plans, and learns from user interaction.

Figure 1: End-to-End Data Flow Architecture

1. Ingestion (Inputs)

The system aggregates three distinct data streams:

Direct Interaction: User chat messages and goal definitions (chat_service.py).
Device Streams: High-frequency vitals (Apple Health/Google Fit) and glucose readings via Dexcom API.
Documents: PDF Lab Reports processed via OCR and LLM extraction (lab_report_service.py).

2. Insight Engine (Synthesis)

Data is normalized in parallel to create a "Patient Snapshot":

Vector Store: Unstructured text is embedded for semantic search.
Health Analysis: Vitals are aggregated into daily snapshots; violations trigger alerts.
Personalization: A fine-tuned model generates a clinical profile (Metabolic type, Lifestyle pillars) based on aggregated history.

3. Coaching (Agentic Decision)

The LangGraph Controller (chat_service.py) orchestrates the response:

Intent Classification: Determines if the query is health-related.
Context Retrieval: Fetches memories, lab summaries, and vitals simultaneously.
Generation: Produces a clinically grounded response using the synthesized context.

4. Agentic Action (Execution)

Insights are converted into concrete database records via the Goal Planner. The goals_service converts intent into structured ActionItems anchored to specific calendar dates.

5. Feedback Loop

Memory: The memory_service (Mem0) extracts facts from conversations.
Scoring: Completed actions trigger a Health Score recalculation, refining the Personalization Profile for the next cycle.

2. Insights Layer: Longitudinal Signals

The Insights Layer transforms raw, high-frequency data streams into coherent health narratives by separating Snapshots (daily state) from Trends (longitudinal evolution) using deterministic logic before AI interpretation.

Figure 2: Signal Processing and Insight Generation

1. Signal Processing Strategy

Wearables (Hourly to Daily): The health_alert_service ingests raw hourly arrays (Steps, HR, Glucose). It immediately executes _calculate_aggregated_summary to compute a deterministic Snapshot (Min/Max/Avg) for that specific local_date. This compresses noisy streams into a canonical daily record.
Labs (Static to Structured): PDF reports are parsed via OCR/LLM into structured JSON (LabReportPropertyExtracted). These are stored sequentially, allowing the system to compare specific markers (e.g., HbA1c) across different upload dates.
History (Trend Construction): When a query requires context, chat_service invokes _format_historical_health_data_for_llm. This fetches the last N days of Snapshots and formats them into a text-based timeline (e.g., "Jan 1: HR 70 | Jan 2: HR 72"), creating a Trend view for the LLM.

2. Preventing Drift & Contradictions

Deterministic Guardrails: The system runs _check_threshold_violations (Python logic) against hardcoded medical ranges (e.g., HR > 120) before the AI is involved. An alert is only generated if the code validates the violation.
Unified Context Injection: The ChatService gathers all data sources (Intake, Labs, Vitals, History) into a single ChatState object. The LLM generates the response from this single source of truth, preventing contradictory outputs caused by fragmented context.
Mathematical Scoring: The HealthScoreService calculates health status using a weighted formula (Labs + Streak + Vitals), ensuring the "Health Score" is a mathematically stable metric rather than an LLM opinion.

3. State & Memory: Representation & Evolution

User state in Evra is a Federated State Model, dynamically assembled from structured metrics, semantic memories, and cached profiles.

Figure 3: Federated State Model

1. User State Representation

Hard State (Metrics): Stored in MongoDB (health_data). We use a Daily Bucket Pattern: a single document represents one user-day (local_date). It contains raw hourly_data arrays and a computed aggregated_summary (the canonical truth for that day).
Soft State (Context): Managed by Mem0. This acts as long-term memory, storing semantic facts extracted from chats (e.g., "User restricts sodium").
Plan State (Intent): Represented by Goals and ActionItems. These are mutable records with specific calendar dates, validated against the user's current week.

2. Persisted vs. Ephemeral

Persisted: Clinical History (Daily buckets, Lab JSONs), Personalization Profile (cached LLM summary), and Conversation Facts (Mem0).
Ephemeral: ChatState. The LangGraph state object (chat_service.py) exists only during a conversation turn. It holds the current query, retrieved vector chunks, and intermediate reasoning before being discarded.

3. Conflict Resolution & Updates

Aggregation Overwrite: When new hourly health data arrives, the system appends the raw point but completely recalculates and overwrites the aggregated_summary field. This ensures the "Daily Snapshot" always reflects the total sum of data, resolving conflicts between partial updates.
Profile Regeneration: The system monitors data freshness. If significant new data (e.g., a new Lab Report) is detected, the cached profile is invalidated and regenerated, ensuring the AI's "mental model" never drifts from the raw data.
Date Anchoring: To prevent historical plan drift, the GoalsService enforces strict date validation (_validate_and_fix_llm_response), anchoring all new action items to specific YYYY-MM-DD dates within the current week, explicitly overriding any relative dates generated by the LLM.

4. Coaching Layer: Logic & Goal Setting

The Coaching Layer translates clinical insights into sustainable behavioral changes. It strictly adheres to a non-clinical persona by decoupling medical analysis from action planning.

Figure 4: Behavior-First Coaching Architecture

1. Structure (Behavior-First)

Persona Enforcement: The ChatService injects a strict system safety block into every prompt, mandating a supportive tone and explicitly forbidding diagnostic language.
Context-Driven Framing: Recommendations are framed using the Personalization Profile (specifically "Lifestyle Pillars"). Advice is tailored to the user's metabolic type rather than generic medical guidelines.
Preference Matching: The system respects user agency via PillarTimePreferences. If a user prefers "Movement" in the mornings, the coaching logic schedules physical activity specifically for that window.

2. Goal Setting & Adjustment

Structured Generation: The GoalsService uses a specific LLM schema (ACTION_ITEM_SCHEMA) to convert high-level intents (e.g., "Lose weight") into concrete database records.
Calendar Anchoring: The system retrieves exact dates for the current week (_get_current_week_dates) and forces the LLM to assign actions to valid YYYY-MM-DD slots, ensuring the plan is actionable immediately.
Adjustment: Adjustment happens via the Feedback Loop. If a user chats about difficulty with a goal, Mem0 records this constraint, updating the Personalization Profile so future plans automatically adjust intensity.

3. Avoiding Repetition & Over-Specificity

Negative Constraints: When generating a new plan, the generate_goal_plan method fetches all existing action items and passes them to the LLM with a strict instruction: "Generate COMPLETELY DIFFERENT action items that do not duplicate ANY of these."
Style Adaptation: The system utilizes the communicationStyle preference (Concise vs. Detailed) stored in the user profile to adjust the verbosity of the guidance, preventing over-specific lectures for users who prefer brevity.

5. Prompt Engineering: Modular Composition & Safety

Prompt engineering in Evra is treated as software architecture, not string concatenation. Prompts are dynamically assembled modules that enforce strict clinical boundaries and personalize the voice before the LLM receives the input.

Figure 5: Dynamic Prompt Assembly

1. High-Level Design: Dynamic Assembly

Prompts are constructed in layers at runtime (chat_service.py), ensuring every request contains the necessary context without exceeding token limits or losing focus.

Base Layer (Persona): Defines the "Evra" identity—supportive, non-clinical, and grounded.
Context Layer (Variable): Conditionally injects data. If the user has "High Risk" vitals, a specific High Risk Escalation block is injected. If they have a "Concise" communication style, a Brevity Instruction block is added.
Safety Layer (Immutable): A hard-coded System Safety Block is prepended to every medical prompt. It explicitly forbids diagnostic language ("Do not provide diagnoses") and mandates referrals to professionals.

2. Scope Enforcement (The Guardrail)

We do not rely on the main LLM to "figure out" if it should answer.

Pre-Flight Classification: Before the main RAG pipeline, a specialized, lightweight LLM call (_health_relevance_check_node) classifies the intent as HEALTH (Symptoms, Wellness) or NOT_HEALTH (Trivia, Politics).
Hard Diversion: If classified as NOT_HEALTH, the system bypasses the medical engine entirely and serves a politeness refusal prompt, preventing the AI from hallucinating medical advice on non-medical topics.

3. Tone & Safety Enforcement

Tone Matching: The system reads the user's communicationStyle preference (stored in MongoDB). The prompt builder selects specific instruction sets (e.g., "Max 2 sentences" vs. "Comprehensive analysis") to match the user's cognitive load preference.
Negative Constraints: To prevent generic advice, we use negative prompting (e.g., "Do not use closings like 'Best regards'", "Do not repeat existing action items").
Structured Output: For logic-heavy tasks (Goal Setting, Alert Filtering), we force the LLM to output strictly defined JSON schemas (Pydantic models) rather than free text, eliminating "chatty" or ambiguous outputs.

6. Agentic Layer: Decision-Making & Execution

The Agentic Layer bridges the gap between knowing something (Insight) and doing something (Action). It uses a "Human-in-the-Loop" architecture where the AI proposes structured actions, but execution requires validation or explicit triggers.

Figure 6: Agentic Logic (Suggest -> Prepare -> Execute)

1. Translation Logic (Suggest -> Prepare -> Execute)

Suggest (Chat): The ChatService generates actionable advice in conversation (e.g., "You should walk 10k steps"). This is unstructured and purely informational.
Prepare (Goal Planning): When the user signals intent (e.g., "Create a plan"), the GoalsService takes over. It uses the ACTION_ITEM_SCHEMA to prepare a concrete schedule. Crucially, it validates dates against the calendar (_validate_and_fix_llm_response) to ensure "Monday" maps to a real future date like 2025-11-18.
Execute (Commitment): Execution happens when the system writes these validated records to the action_items collection. This transitions the action from a "suggestion" to a "commitment" tracked by the system.

2. Confirmation Loops

Implicit Confirmation: For lower-risk actions (like saving a conversation memory), the system acts autonomously via Mem0 in the background.
Explicit Confirmation: For high-value actions (like creating a Weekly Goal), the interaction model is explicit. The user asks for a plan, and the system generates it.

3. Reversibility & Conflict Handling

Immutable Logs, Mutable Plans: While health data logs (snapshots) are immutable to preserve history, Goal Plans are mutable. If a user's context changes (e.g., "I hurt my knee"), the system can generate a new plan that supersedes the old one.
Soft Deletion: "Deleting" a goal (delete_goal) performs a soft delete or cascades to related items (action items), allowing for data recovery or audit trails.

4. Failure Handling

Graceful Degradation: If the Vector Store or an external API (like MealoLogic) fails during a chat, the ChatService catches the exception and continues with whatever context is available (e.g., just User Profile and Vitals), ensuring the user still gets a helpful, albeit less rich, response.
Retry Policies: Critical external calls (LLM generation, Dexcom Sync) are wrapped in retry logic with exponential backoff to handle transient network blips without crashing the user session.

7. Safety & Guardrails: Constraining Authority

Evra enforces safety through architectural constraints rather than relying solely on model training. We treat the LLM as a text processing engine, not a doctor, by wrapping it in deterministic logic layers.

Figure 7: Safety Pipeline and Risk Escalation

1. Scope & Output Constraints

Pre-Flight Guardrail: Before processing any medical logic, a specialized lightweight LLM (_health_relevance_check_node) classifies the query. If the intent is NOT_HEALTH (e.g., politics, coding), the system hard-diverts to a refusal template, preventing the AI from engaging in out-of-scope hallucination.
Structured Schemas: For critical outputs (Action Plans, Health Alerts), we force the LLM to output strict JSON schemas (Pydantic models). This physically prevents the model from generating conversational filler or ambiguous medical advice in these high-stakes fields.

2. Avoiding Implied Medical Authority

System Injection: Every health-related prompt is prefixed with an immutable Safety Block (prompts.py). It explicitly forbids diagnostic language ("Do not provide diagnoses") and mandates a supportive, non-clinical tone.
Context Grounding: The RAG architecture forces the model to generate answers based only on the retrieved context (User Vitals, Lab JSONs). If the context is missing, the model is instructed to ask clarifying questions rather than inferring medical facts from its training data.

3. Uncertainty & Escalation

Deterministic Triggers: High-risk scenarios (e.g., Tachycardia) are identified by Python logic (hardcoded thresholds in health_alert_service.py), not the LLM. If a threshold is breached, the code injects a specific "High Risk Escalation" instruction into the prompt, forcing the LLM to recommend immediate medical attention.
Symptom Mode: If the Classifier detects "SYMPTOM" intent, the prompt dynamically changes to "Inquiry Mode," forcing the AI to ask structured follow-up questions (Duration, Severity) instead of offering a solution.

8. Evaluation & Learnings: Real-World Performance

Our architecture has evolved significantly based on real-world friction points. The biggest lesson is that latency kills engagement and determinism beats cleverness.

Figure 8: Wins, Failures, and Architectural Shifts

1. What Worked (The Wins)

LLM Extraction vs. Regex: We found that traditional PDF parsing (Regex/OCR) failed on complex, multi-column lab reports. Moving to an LLM-based Extraction Pipeline (lab_report_service.py using gpt-4o-mini) allowed us to capture 95%+ of test properties (Values, Units, Ranges) regardless of layout, proving that small, specialized models are superior for structured data tasks.
Hybrid Sync Strategy: Syncing health data efficiently was challenging. We adopted a Hybrid Strategy: high-frequency metrics (Steps, HR) use Incremental Sync (appending raw arrays), while cumulative metrics (Sleep, Active Energy) use a Windowed Query (snapshotting daily totals). This prevents data gaps while minimizing database write load.
Fine-Tuned Personalization: Using a fine-tuned model (ft:o4-mini...) for the PersonalizationProfileService drastically improved the quality of the "User Summary." It consistently identifies metabolic types and lifestyle pillars faster and more accurately than a generic prompt, anchoring the entire system's advice.

2. What Broke (The Failures)

Context Overload Latency: Early versions tried to inject everything—full vector docs, raw lab JSONs, and meal catalogs—into the chat prompt sequentially. This caused 15s+ delays. We learned we cannot block the UI; we had to implement Parallel Context Retrieval (asyncio.gather) to fetch Vector Docs, Lab Summaries, Vitals, and Meal Recommendations simultaneously, keeping chat latency low despite the massive context window.
Date Hallucination: The LLM consistently failed to understand relative time (e.g., "Next Monday"), generating plans with placeholders like YYYY-MM-DD. This broke the database schema. We learned that Time is a Deterministic Concept that must be handled by code, not AI.

3. Architectural Shifts

Logic Injection over Prompting: We moved from trying to prompt the LLM to "be accurate with dates" to writing Python validators (_validate_and_fix_llm_response). These intercept the LLM output, look up the actual calendar dates for the user's current week, and rewrite the JSON payload before it ever hits the database.
Grounded RAG with GPT-5.1: To combat hallucination with our large context window (Medical Docs + Labs + Goals + MealoLogic), we switched to a highly structured RAG prompt (ChatPrompts). It enforces a strict hierarchy: User Memories override General Advice, and Lab Data overrides Generic Guidelines. This ensures the advanced GPT-5.1 model remains grounded in the user's specific reality.

9. Failure Modes: Detection & Mitigation

The Evra system is designed with a Defense-in-Depth strategy. Instead of assuming perfect AI behavior or 100% API uptime, the architecture anticipates failure at the dependency, logic, and data layers, implementing specific "Safety Nets" for each.

Figure 9: Failure Detection and Mitigation Logic

1. External Dependency Failure (APIs)

The Failure: Critical external services (OpenAI, Dexcom, MealoLogic) time out, return 500 errors, or hit rate limits.
Detection: try/except blocks wrapping every external call, logging specific error types (e.g., RateLimitError, Timeout).
Mitigation:
- Retry Policy: Critical calls (LLM generation, Dexcom Sync) use exponential backoff strategies (e.g., max_retries=5 in health_alert_service.py).
- Graceful Degradation: If the Vector Store or MealoLogic fails during a chat, the ChatService catches the exception and proceeds with Partial Context. The user receives a response based on available data (e.g., Vitals only) rather than an error screen.

2. Logic & Hallucination Failure (The "Drift")

The Failure: The LLM generates invalid JSON, invents non-existent dates (e.g., 2025-02-30), or uses placeholder strings (YYYY-MM-DD) instead of concrete data.
Detection: Pydantic schema validation fails on output parsing; Custom regex checks identify placeholder patterns in the GoalsService.
Mitigation:
- Auto-Fixer: The _validate_and_fix_llm_response function intercepts invalid dates and algorithmically maps them to the correct YYYY-MM-DD based on the current calendar week logic, silently correcting the AI's mistake before database commit.
- Refusal Fallback: If the scope guardrail is unsure, it defaults to a safe "Informational" mode rather than risking a medical hallucination.

3. Data Ingestion Failure (Docs & Streams)

The Failure: A user uploads a corrupted PDF, an image-only PDF (unreadable by text extractors), or Dexcom tokens expire.
Detection: pdf_bytes conversion returns empty strings; OAuth refresh flows return 401.
Mitigation:
- Fallback Metadata: If LLM extraction of a Lab Report fails, the system triggers _generate_fallback_metadata to create a basic record using filename heuristics, ensuring the file is saved even if unparsed.
- Forced Re-Auth: Dexcom 401 errors trigger a specific disconnect flag, prompting the frontend to request re-authentication rather than silently failing background syncs.

10. What's Non-Trivial: Integration Challenges

Replicating Evra is difficult not because of any single component, but because of the "Three-Speed Integration" problem. The system must harmonize High-Frequency Streams (Device Vitals), Static Documents (PDFs), and Interactive Latency (Chat) into a single, medically safe narrative.

Figure 10: The Three-Speed Integration Problem

1. Integration Complexity (The Temporal Mismatch)

The Problem: LLMs have no concept of "Time" or "State" across different data velocities. A Dexcom reading arrives every 5 minutes; a Lab Report arrives once a month.
The Non-Trivial Solution: The Federated State Model. The system effectively "stops time" for the LLM by creating a Daily Snapshot. It aggregates high-speed streams into canonical summary statistics (Min/Max/Avg) and merges them with static document extracts. Building the logic to handle these collisions—overwriting summaries while appending raw logs—without data loss is a significant engineering hurdle.

2. Iteration Cycles (Code-Backed Prompting)

The Problem: Standard prompt engineering hits a ceiling. You cannot prompt an LLM to reliably output "Next Monday's Date." It will eventually hallucinate.
The Non-Trivial Solution: We moved from "Prompt Engineering" to "Logic Injection." The iteration cycle isn't just rewriting prompts; it involves writing Python validators (_validate_and_fix_llm_response) that intercept LLM outputs, check them against the system calendar, and rewrite the payload before it hits the database. This hybrid "Neural-Symbolic" loop takes weeks to tune properly.

3. Key Tradeoffs (Determinism vs. Flexibility)

Safety over Flow: We explicitly traded conversational flexibility for safety. By moving alert logic out of the AI and into Python thresholds (_check_threshold_violations), the chatbot cannot "talk" a user out of a high heart rate alert. It makes the system feel rigid at times but prevents catastrophic medical advice.
Latency over Completeness: We utilize Parallel Context Retrieval. Instead of waiting for a perfect deep-dive analysis, we fire simultaneous requests to Vector Store, Vitals DB, and Meal Services. If one hangs, we degrade gracefully rather than block, prioritizing the user's engagement loop over perfect data completeness.

Table of Contents