LLMs alone aren't production intelligence. You need RAG to ground answers in your data, Context Engineering to make outputs policy- and role-aware, Agentic frameworks (e.g., LangGraph) to plan and act, LLMOps (e.g., LangSmith) to evaluate and observe, and MCP to standardize tool/data access across apps. Together, this is how you get reliable, explainable, and scalable GenAI in the enterprise.
LLMs alone aren't production intelligence. You need RAG to ground answers in your data, Context Engineering to make outputs policy- and role-aware, Agentic frameworks (e.g., LangGraph) to plan and act, LLMOps (e.g., LangSmith) to evaluate and observe, and MCP to standardize tool/data access across apps. Together, this is how you get reliable, explainable, and scalable GenAI in the enterprise.
The promise of LLMs—ChatGPT, Claude, Gemini—is seductive: just give them a prompt and get back intelligent text. But if you're building production AI for healthcare, finance, or any regulated enterprise, you'll quickly hit four walls:
This is where the modern LLM stack comes in. It's not one technology—it's an orchestration of complementary patterns:
RAG (Retrieval-Augmented Generation) is the de-facto way to cut hallucinations by injecting live, proprietary knowledge into prompts. It's not training; it's runtime context injection. Think of it as giving the LLM a cheat sheet before every answer.
Agentic systems (LangGraph, CrewAI, AutoGPT) turn one-shot prompts into plans with tools, memory, and multi-step control flow. An agent can decide: "First, I'll search the knowledge base. If that fails, I'll query the SQL database. Then I'll summarize and ask the user for confirmation." This is the difference between a chatbot and an AI co-worker.
MCP (Model Context Protocol) is emerging as the USB-C for AI apps, standardizing how models access tools, data, and prompts. Instead of writing custom connectors for Slack, Salesforce, and your EHR, you expose MCP servers. Any compliant agent can plug in and use them securely.
LangChain/LangGraph/LangSmith/LangServe give you the bricks for building, orchestrating, testing, and serving—all production-grade. LangChain is the Swiss Army knife for LLM chains; LangGraph adds stateful orchestration; LangSmith provides observability and evaluation; LangServe wraps it all as REST APIs.
A tier-1 health system we worked with was using GPT-4 to answer clinical guideline questions for care coordinators. Initial accuracy: 67%. After implementing RAG over their guideline corpus + context engineering for clinical roles + LangSmith evaluation loops:
The difference? Not a better model—better architecture.
Below is a conceptual map of how these pieces fit together. Don't worry if it feels dense—we'll unpack each layer in the sections that follow.
What it is: RAG is a pattern that retrieves relevant knowledge (from databases, documents, EHRs, wikis, or any corpus) and injects it into the model's context window so outputs cite real, verifiable data instead of hallucinating.
1from langchain_openai import ChatOpenAI, OpenAIEmbeddings
2from langchain_core.prompts import ChatPromptTemplate
3from langchain_community.vectorstores import FAISS
4from langchain_text_splitters import RecursiveCharacterTextSplitter
5from langchain_community.document_loaders import PyPDFLoader
6
7# 1) Ingest & chunk
8loader = PyPDFLoader("clinical_guidelines_2024.pdf")
9docs = loader.load()
10splitter = RecursiveCharacterTextSplitter(
11    chunk_size=800, 
12    chunk_overlap=120,
13    separators=["\n\n", "\n", ". ", " ", ""]
14)
15chunks = splitter.split_documents(docs)
16
17# 2) Embed & index
18embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
19vectorstore = FAISS.from_documents(chunks, embeddings)
20
21# 3) Retrieve with metadata filtering
22retriever = vectorstore.as_retriever(
23    search_type="mmr",  # Maximum Marginal Relevance for diversity
24    search_kwargs={"k": 4, "fetch_k": 20}
25)
26
27# 4) Prompt with grounded context
28prompt = ChatPromptTemplate.from_messages([
29  ("system", """You are a clinical decision support assistant. 
30  Answer using ONLY the provided context. If the answer is not in the context, say "I don't have enough information." 
31  Always cite the source document and page number."""),
32  ("human", "Question: {question}\n\nContext:\n{context}\n\nSources: {sources}")
33])
34
35llm = ChatOpenAI(model="gpt-4o", temperature=0)
36
37def answer_with_rag(question: str):
38    # Retrieve relevant chunks
39    docs = retriever.get_relevant_documents(question)
40    
41    # Format context and sources
42    context = "\n\n".join([f"[Doc {i+1}] {d.page_content}" for i, d in enumerate(docs)])
43    sources = "\n".join([f"- {d.metadata.get('source', 'Unknown')}, Page {d.metadata.get('page', 'N/A')}" 
44                          for d in docs])
45    
46    # Generate answer
47    messages = prompt.format_messages(
48        question=question, 
49        context=context,
50        sources=sources
51    )
52    return llm.invoke(messages)
53
54# Example usage
55result = answer_with_rag("What is the recommended initial dose of vancomycin for septic shock?")
56print(result.content)Fine-tuning is expensive (100K per model iteration), slow (days to weeks), and brittle (stale the moment your data changes). RAG is:
When to fine-tune instead: When you need to change the style or structure of outputs (e.g., "always respond in medical SOAP note format"), not the facts.
Challenge: A 500-bed hospital needed an AI assistant to help nurses and residents answer questions about 2,000+ pages of clinical protocols, updated quarterly.
Solution Architecture:
Results: 89% accuracy on 500 test questions vs. 61% for ChatGPT alone. Average query time: 1.2s. Nurses reported 40% reduction in time spent searching protocols.
Goal: Apply role, policy, structure, and memory so outputs are consistent, compliant, and useful across your organization.
RAG solves "What does the model know?" Context engineering solves "How does the model reason?" It's the difference between an intern with access to the right files and a senior analyst who knows how to interpret those files in your company's context.
Define who the model is and who the user is. This shapes tone, depth, and risk tolerance.
"You are a clinical QA auditor with 10 years of HEDIS experience. The user is a care coordinator asking about diabetes management protocols. Prioritize patient safety over brevity. Always cite CMS guidelines when applicable."
Hard rules the model must follow. These are non-negotiable.
Track conversation history and user preferences.
Short-term: Last 5-10 turns in the current session (stored in-memory or Redis)
Long-term: User profile, past decisions, organizational context (stored in PostgreSQL with RAG retrieval)
Example: "User prefers detailed technical explanations" or "This department always requests pediatric dosing tables"
Route queries to the right data source(s) based on intent.
if query_type == "policy": → RAG over guideline corpus
elif query_type == "patient_data": → SQL tool with row-level security
elif query_type == "pricing": → API call to ERP system
else: → General knowledge (base LLM)
1from langchain_core.prompts import ChatPromptTemplate
2from langchain_openai import ChatOpenAI
3
4# Build a layered context template
5system_prompt = """
6[ROLE]
7You are a Clinical Decision Support AI for Memorial Health System. 
8You assist care coordinators, nurses, and physicians with evidence-based guidance.
9Your tone is professional, empathetic, and precise.
10
11[CONSTRAINTS]
121. NEVER diagnose. Always recommend "consult a physician" for diagnostic questions.
132. Cite sources for every clinical recommendation (format: [Source: CDC MMWR 2024-12])
143. If the query involves PHI (patient names, MRNs), refuse and log to audit.
154. For medication dosing, include age/weight considerations and check for contraindications.
165. Uncertainty threshold: If confidence < 80%, say "I recommend consulting a specialist" and explain why.
17
18[MEMORY CONTEXT]
19User: {user_profile}
20Recent topics: {recent_topics}
21Department: {department}
22
23[DATA ROUTING RULES]
24- Clinical protocols → RAG retrieval from protocols_vectorstore
25- Patient-specific data → SQL query with RLS (row-level security)
26- Drug interactions → Call external API (Lexicomp/UpToDate)
27- General medical knowledge → Base LLM knowledge (GPT-4)
28
29[OUTPUT FORMAT]
301. Direct answer (2-3 sentences)
312. Clinical rationale (evidence-based, with sources)
323. Action items (if applicable)
334. Escalation trigger (if uncertainty > 20% or high-risk scenario)
34"""
35
36# Usage in a chain
37prompt = ChatPromptTemplate.from_messages([
38    ("system", system_prompt),
39    ("human", "{query}")
40])
41
42llm = ChatOpenAI(model="gpt-4o", temperature=0.2)
43
44# Invoke with context variables
45response = llm.invoke(prompt.format_messages(
46    user_profile="Nurse, 8 years cardiology experience, prefers detailed explanations",
47    recent_topics="sepsis management, antibiotic stewardship",
48    department="Critical Care ICU",
49    query="What's the current guideline for vasopressor choice in septic shock?"
50))(This is where Finarb's consulting-led design shows up: we codify your organization's process, compliance requirements, and decision logic into the prompt and tooling layer—so outputs match how your teams actually work, not generic AI responses.)
Without context engineering, your LLM is a brilliant generalist—but it doesn't know your legal obligations, risk tolerance, or organizational norms. That's how you end up with an AI recommending a treatment your hospital doesn't offer, citing a guideline version that's been deprecated, or exposing PHI in logs. Context engineering closes the gap between "technically correct" and "safe for production."
LLMs don't just chat; they can plan—call tools, write code, check results, and iterate until a task is complete. This is agentic AI: systems that autonomously decompose goals, take actions, and adapt based on feedback.
LangGraph is the de-facto framework for building stateful, production-grade agents. It's not a high-level "magic" library—it's a low-level orchestration tool that gives you explicit control over state, branching, loops, human-in-the-loop (HITL) checkpoints, and persistence.
Prompt-based agents (e.g., ReAct pattern)
LangGraph agents
1from langgraph.graph import StateGraph, END
2from langchain_core.messages import HumanMessage, AIMessage
3from typing import TypedDict, Annotated
4import operator
5
6# Define agent state
7class AgentState(TypedDict):
8    messages: Annotated[list, operator.add]  # Append-only message history
9    goal: str
10    context: str
11    next_step: str
12
13# Node functions
14def plan(state: AgentState):
15    """Decide next step based on goal and context"""
16    goal = state["goal"]
17    if "guideline" in goal.lower() or "protocol" in goal.lower():
18        return {"next_step": "search_rag"}
19    elif "patient" in goal.lower() or "data" in goal.lower():
20        return {"next_step": "query_sql"}
21    else:
22        return {"next_step": "respond_directly"}
23
24def search_rag(state: AgentState):
25    """Retrieve from vector store"""
26    # Pseudo-code: actual implementation calls retriever
27    retrieved = "Retrieved: Sepsis protocol v2.4, updated Jan 2025..."
28    return {
29        "context": state.get("context", "") + f"\n[RAG] {retrieved}",
30        "next_step": "respond"
31    }
32
33def query_sql(state: AgentState):
34    """Execute SQL with row-level security"""
35    # Pseudo-code: actual implementation validates + executes query
36    sql_result = "Query result: 3 patients match criteria..."
37    return {
38        "context": state.get("context", "") + f"\n[SQL] {sql_result}",
39        "next_step": "respond"
40    }
41
42def respond(state: AgentState):
43    """LLM synthesizes final answer"""
44    llm = ChatOpenAI(model="gpt-4o", temperature=0)
45    prompt = f"""Goal: {state['goal']}
46    
47Context gathered:
48{state['context']}
49
50Provide a clear, evidence-based answer with sources."""
51    
52    answer = llm.invoke(prompt).content
53    return {
54        "messages": [AIMessage(content=answer)],
55        "next_step": END
56    }
57
58# Build graph
59graph = StateGraph(AgentState)
60
61# Add nodes
62graph.add_node("plan", plan)
63graph.add_node("search_rag", search_rag)
64graph.add_node("query_sql", query_sql)
65graph.add_node("respond", respond)
66
67# Add edges (control flow)
68graph.set_entry_point("plan")
69graph.add_conditional_edges(
70    "plan",
71    lambda state: state["next_step"],
72    {
73        "search_rag": "search_rag",
74        "query_sql": "query_sql",
75        "respond_directly": "respond"
76    }
77)
78graph.add_edge("search_rag", "respond")
79graph.add_edge("query_sql", "respond")
80graph.add_edge("respond", END)
81
82# Compile and run
83agent = graph.compile()
84result = agent.invoke({
85    "messages": [HumanMessage(content="What are the latest sepsis protocols?")],
86    "goal": "What are the latest sepsis protocols?",
87    "context": "",
88    "next_step": ""
89})
90print(result["messages"][-1].content)It's a low-level orchestration framework for long-running, stateful agents with:
Scenario: A health system needs an AI assistant to help care coordinators prepare for patient discharge. The workflow involves:
Implementation: A LangGraph agent with 7 nodes: fetch_patient_data → check_vitals → check_guidelines → generate_instructions → flag_risks → [HITL gate] → finalize
Result: Average discharge prep time dropped from 45 minutes to 12 minutes. Error rate (missing contraindications) dropped from 8% to 0.4%.
What is MCP? The Model Context Protocol is an open standard (think "OpenAPI for AI") so AI apps and agents can plug into data resources, callable tools, and reusable prompts through a common interface—regardless of vendor.
Think of it as USB-C for AI. Before USB-C, every device had its own charging cable. Before MCP, every AI app had to write custom connectors for Slack, Salesforce, your EHR, etc. MCP fixes this.
URIs that expose data: ehr://patients/{id}, crm://accounts/{id}
The MCP server handles auth, permissions, and formatting. The agent just requests the resource.
Callable functions with typed schemas: create_ticket(title, priority), run_sql(query)
Similar to OpenAI function calling, but standardized across providers.
Reusable prompt templates: clinical_audit_template, phi_redaction_prompt
Share battle-tested prompts across assistants without copy-paste.
Write one EHR connector (MCP server). Every AI assistant in your organization can use it—no more per-app integrations.
MCP spec includes first-class support for user consent, audit logging, and least-privilege access. The AI can't just "read everything."
Works with LangChain agents, OpenAI Assistants API, Anthropic Claude, or any compliant client. Not locked into one ecosystem.
1{
2  "server": "ehr-mcp-server",
3  "version": "1.0",
4  "features": {
5    "resources": [
6      {
7        "uri": "ehr://patients/{patient_id}",
8        "description": "Retrieve patient demographics and vitals",
9        "permissions": ["read:patient_data"]
10      },
11      {
12        "uri": "ehr://labs/{patient_id}",
13        "description": "Retrieve recent lab results",
14        "permissions": ["read:lab_data"]
15      }
16    ],
17    "tools": [
18      {
19        "name": "get_patient_summary",
20        "description": "Fetch a structured summary of patient demographics, vitals, medications, and recent visits",
21        "parameters": {
22          "patient_id": {"type": "string", "required": true},
23          "include_labs": {"type": "boolean", "default": false}
24        },
25        "returns": {"type": "object", "schema": "PatientSummary"}
26      },
27      {
28        "name": "run_sql_query",
29        "description": "Execute a read-only SQL query against the clinical data warehouse (RLS applied)",
30        "parameters": {
31          "query": {"type": "string", "required": true}
32        },
33        "returns": {"type": "array"}
34      }
35    ],
36    "prompts": [
37      {
38        "name": "clinical_audit_template",
39        "description": "Prompt template for auditing clinical notes against HEDIS measures",
40        "template": "You are a clinical QA auditor. Review the following note for completeness against HEDIS measure {measure_id}: {note_text}"
41      },
42      {
43        "name": "phi_redaction_prompt",
44        "description": "Prompt to redact PHI from text",
45        "template": "Redact all PHI (names, MRNs, DOBs, addresses) from: {text}"
46      }
47    ]
48  }
49}Takeaway: MCP lets your data and tools become first-class citizens that any compliant agent can discover, request permission for, and use safely. It's not just "API wrappers"—it's a protocol that enforces consent, schema validation, and audit trails out of the box.
Phase 1 (Pilot): Identify 2-3 high-value data sources (e.g., EHR, CRM, Data Warehouse). Build MCP servers for them. Test with one AI assistant.
Phase 2 (Scale): Standardize on MCP for all new AI integrations. Deprecate custom connectors. Train AI teams on MCP client libraries.
Phase 3 (Ecosystem): Publish internal MCP server catalog. Enable any team to spin up an AI assistant that auto-discovers available tools/data via MCP registry.
Timeline: 3-6 months for most enterprises. Payoff: 10x faster time-to-market for new AI use cases.
You've built a RAG pipeline, engineered your context, and wired up agents. Now how do you know if it's good? How do you catch regressions? How do you deploy it securely?
This is where LLMOps (LLM Operations) comes in: the MLOps equivalent for generative AI. Two tools dominate the LangChain ecosystem:
What it does: Trace every LLM call, tool invocation, and agent step. Build eval datasets from production traces. Run evals (LLM-as-judge + human review). Monitor quality, cost, and latency in real-time.
Key Features:
What it does: Wrap any LangChain chain or LangGraph agent as a FastAPI endpoint with auto-generated OpenAPI docs, client SDKs, and authentication.
Key Features:
@add_routes(app, my_chain)1import os
2from langchain_openai import ChatOpenAI
3from langchain_core.prompts import ChatPromptTemplate
4
5# Set LangSmith environment variables
6os.environ["LANGCHAIN_TRACING_V2"] = "true"
7os.environ["LANGCHAIN_API_KEY"] = "your_langsmith_api_key"
8os.environ["LANGCHAIN_PROJECT"] = "clinical-assistant-prod"  # Project name in UI
9
10# Your chain (nothing special needed—tracing is automatic)
11prompt = ChatPromptTemplate.from_messages([
12    ("system", "You are a helpful clinical assistant."),
13    ("human", "{question}")
14])
15llm = ChatOpenAI(model="gpt-4o", temperature=0)
16chain = prompt | llm
17
18# Run it—LangSmith auto-captures everything
19result = chain.invoke({"question": "What are the signs of sepsis?"})
20print(result.content)
21
22# In LangSmith UI: see full trace with latency breakdown, token counts, cost estimateDon't wait until production to evaluate. Use LangSmith's eval suite in CI/CD:
langsmith eval run --dataset golden-setThis is how Anthropic, OpenAI, and top AI teams ship reliably. Evaluation is not optional.
Below is a conceptual architecture diagram showing how RAG, Context Engineering, Agentic AI (LangGraph), MCP, and LLMOps (LangSmith/LangServe) interconnect in a production LLM stack.
Graph-aware retrieval, hybrid sparse+dense, reranking, provenance tracking.
Long-running agents with robust state, guardrails, and HITL checkpoints (LangGraph direction).
MCP unifying how assistants connect to your systems across vendors.
Eval-driven iteration (LangSmith) becomes table stakes for compliance and ROI.
As a consult-to-operate partner, we don't just wire components—we design the decision system around your business: