No Fine-Tuning, No Problem: Building 4 AI Features with AWS Well-Architected

A practical guide to the generative AI lifecycle

Featured image

No Fine-Tuning, No Problem: Building 4 AI Features with AWS Well-Architected

Building production-ready generative AI applications requires more than just picking a model and writing prompts. It demands a structured approach that considers business value, technical feasibility, and long-term sustainability.

I used the AWS Well-Architected Generative AI Lens as my framework—a six-phase lifecycle that forces you to think before you build.

Generative AI Lifecycle

In this post, I’ll walk through each phase and show how I applied this framework to build four AI features for FlowSpace—an assignment management app for students and educators.

The Four Features

Before diving into the lifecycle, here’s what we’re building:

  1. AI Tutor (Socratic Engine) - Interactive problem-solving with MCQ verification and concept mastery tracking
  2. Brain Dump & Assignment Extraction - Extract actionable tasks from natural language or uploaded documents (syllabi, assignment sheets)
  3. Task Priority Engine - Intelligent task ranking with personalized recommendations
  4. Professor Fix Package & Tutor Visualization Engine - Educational content packages with misconception corrections, and interactive visualizations for tutoring

Each feature has different requirements, leading to different decisions at each phase.


Phase 1: Scoping - Start with the Business Problem

The scoping phase prioritizes understanding the business problem before any technical decisions. I ask: Is generative AI the right solution? What does success look like? What are the real costs?

How I Scoped Each Feature

AI Tutor (Socratic Engine)

The core question: What happens when the model gives a wrong answer?

This isn’t a simple Q&A chatbot. The Socratic Engine guides students through problems step-by-step, generates MCQs to verify understanding after each step, and tracks concept mastery using EMA scoring. It also provides meta-actions: Visualize (explain with diagrams), Breakdown (decompose problems), and Ask (free-form questions).

A wrong response doesn’t just waste time—it teaches incorrect concepts that the mastery system then reinforces. The cost of hallucination (student learning wrong material, eroded trust, academic harm) far exceeds the cost of using a premium model.

Success metric: < 2% factual error rate in step explanations and MCQ generation

Cost tolerance: High—accuracy trumps cost

Brain Dump & Assignment Extraction

The core question: What are the failure modes?

Two critical requirements emerged:

  1. No invented tasks - The model must never hallucinate tasks that weren’t mentioned
  2. No missed tasks - When all generated subtasks are completed, the original assignment must be done

Students might brain dump multiple times daily, so cost matters. But the real constraint was finding a model with low hallucination AND moderate completeness—cheap models that miss tasks or invent phantom tasks both fail the user.

Success metric: Zero invented tasks, > 95% task coverage

Cost tolerance: Low—but not at the expense of reliability

Task Priority Engine

The core question: Do we even need an LLM for this?

Initially, I planned to use an LLM for nuanced prioritization. But during evaluation, I realized: prioritization is fundamentally about applying consistent rules to structured data. The Eisenhower matrix (urgent/important) combined with custom variables (deadline proximity, estimated effort, current progress) could handle this deterministically—faster, cheaper, and more predictable than an LLM.

Success metric: User follows recommendation > 70% of the time

Cost tolerance: Zero—no LLM means no per-request cost

Professor Fix Package & Tutor Visualization Engine

The core question: What content do professors actually need?

Two distinct but related needs emerged:

  1. Fix Packages - Ready-made educational content to address common student misconceptions: examples that correct wrong mental models, clear explanations, and step-by-step corrections
  2. Tutor Visualizations - Interactive diagrams and visual aids that clarify abstract concepts during tutoring sessions

Both require strong code generation (for interactive visualizations) and pedagogical accuracy. The content must be correct—a visualization that reinforces a misconception is worse than no visualization at all.

Success metric: Generated content is pedagogically accurate and visualizations render correctly > 95%

Cost tolerance: Medium—quality over cost, but not unlimited

Scoping Summary

Feature Primary Risk Cost Tolerance Complexity
AI Tutor Hallucination harms learning High Single model
Brain Dump Invented or missed tasks Low-Mid Single model
Priority Engine Inconsistent recommendations Zero No LLM needed
Fix Package/Viz Pedagogical accuracy matters Medium Single model

Phase 2: Model Selection - Match Capabilities to Requirements

There is no ‘silver-bullet’ model—it’s about matching capabilities to your specific constraints.

My Evaluation Process

I considered a wide range of options: AWS Bedrock models (Claude, Titan), OpenAI’s GPT-4o and GPT-5.2, Google’s Gemini family, and even on-device options like Apple’s Foundation Models.

Bedrock was tempting—easy cost management (I’m already on AWS), seamless integration with Lambda, unified billing. But many Bedrock models disappointed on price-performance ratio for my specific use cases.

Selection criteria:

No amount of benchmark reading replaces actually testing with your data.

How I Selected Models

AI Tutor (Socratic Engine)

Requirements drove me toward higher-tier models:

I evaluated models on math and science tutoring benchmarks, specifically looking at reasoning trace quality, not just final answer accuracy. The model needed to “show its work” correctly.

The choice came down to Claude Opus 4.5 vs Gemini 3.0 Pro. While Opus 4.5’s pricing had dropped, Gemini 3.0 Pro was still more cost-effective. In practice, 3.0 Pro’s reasoning quality exceeded my expectations for tutoring use cases.

Selected: Gemini 3.0 Pro

Brain Dump & Assignment Extraction

This is fundamentally an extraction task, but with a tricky balance: the model must extract all relevant tasks (completeness) without inventing tasks that weren’t there (no hallucination).

Evaluation criteria:

I tested multiple models including on-device options like Apple’s Foundation Models. Some cheap models had good coverage but invented tasks. Others never hallucinated but missed obvious items. Finding the right balance took about two months of iteration. (I’ll share the detailed lessons from evaluating Apple’s Foundation Models in a future post.)

Selected: Gemini 2.5 Flash (no thinking mode)

Professor Fix Package & Tutor Visualization Engine

Model longevity was a key concern. I tracked that Gemini 2.0 Flash deprecation was scheduled for February 2026. When 3.0 Pro was released, I knew 3.0 Flash was imminent. Rather than building on a soon-to-deprecate model, I waited.

When Gemini 3.0 Flash launched, I ran early evaluations on:

The results were exceptional—better than models I’d used previously at similar cost points, especially for generating educational content with embedded code for visualizations.

Selected: Gemini 3.0 Flash

Model Selection Summary

Feature Model Selected Key Selection Criterion
AI Tutor Gemini 3.0 Pro Reasoning quality at reasonable cost
Brain Dump Gemini 2.5 Flash (no thinking) Low hallucination + high coverage
Fix Package/Viz Gemini 3.0 Flash Educational content + visualization quality

Phase 3: Model Customization - Make It Yours

I haven’t done traditional model customization (fine-tuning, RLHF) for these features yet. Instead, I’m building a modular prompt management system inspired by Claude’s Agent Skills.

The Architecture: Feature Lambdas + Bedrock Orchestrator

Each AI feature has its own Lambda endpoint with tailored prompts and functionality. For the Tutor (the most complex feature), a cheap Bedrock model acts as the orchestrator—selecting the most appropriate skill file from S3 based on the request.

Client Request
    → Feature-specific Lambda
    → (Tutor only) Bedrock orchestrator selects skill from S3
    → Assemble prompt with context + instructions
    → Call AI API (Gemini)
    → Validate response against schema
    → Return structured result

Why this approach?

  1. Feature isolation - Each Lambda has focused responsibility, easier to debug and scale independently
  2. Smart routing (Tutor) - Cheap orchestrator model picks the right skill without hardcoded logic
  3. Hot-swappable skills - Update prompts in S3 without redeploying
  4. Cost visibility - Per-feature usage tracking

(For why I chose Lambda over alternatives like AWS Bedrock AgentCore, see my infrastructure decision analysis.)

How I Customized Each Feature

AI Tutor (Socratic Engine)

The Tutor is the most complex feature, so it gets dedicated S3 skill definitions. The orchestrator routes tutoring requests to skills containing:

  1. Socratic questioning patterns - Never give direct answers; guide through questions
  2. MCQ generation rules - Questions must verify understanding, not just recall
  3. Concept extraction schema - Normalized concept names for consistent mastery tracking

The client sends the full conversation context with each request, keeping the Lambda stateless. Session state (current step, hint level, mastery scores) is managed on-device.

Guardrails (AWS best practice): I’m implementing guardrails to enforce academic integrity:

Brain Dump & Assignment Extraction

Strict schema enforcement was the primary customization:

Extract ONLY tasks explicitly mentioned or clearly implied.
Never invent tasks the user didn't reference.
If text is ambiguous, extract nothing rather than guess.

The prompt includes few-shot examples showing correct extraction (including examples of what NOT to extract). Output schema validation rejects responses that don’t conform.

Professor Fix Package & Tutor Visualization Engine

Prompt templates optimized for educational content generation:

The key innovation: each “fix package” combines narrative explanation with interactive visualization code, generated together to ensure conceptual alignment.

Customization Summary

Feature Primary Technique Core Focus
AI Tutor S3 skill injection Socratic patterns + MCQ generation
Brain Dump Schema enforcement Strict extraction-only rules
Fix Package/Viz Template prompts Paired explanation + visualization

(In Progress) Phase 4: Development & Integration - Build the System

The model is often the smallest component of a production generative AI system. Integration brings everything together.

How I Built Each Feature

AI Tutor (Socratic Engine)

Clear separation between cloud and device:

Brain Dump & Assignment Extraction

Two input modes, same extraction pipeline:

Input (natural language OR uploaded document)
    → Lambda (multimodal LLM handles both text and document parsing)
    → Schema validation (reject if non-conformant)
    → DynamoDB update
    → Return structured tasks

No separate OCR step—the multimodal model handles document parsing directly. No RAG, no complex orchestration.

Professor Fix Package & Tutor Visualization Engine

Two distinct pipelines for different content types:

Fix Packages (static educational content):

Professor selects misconception topic
    → Lambda generates explanation + examples
    → Professor reviews and approves
    → Package uploaded to S3
    → Students access via pre-signed URLs

Tutor Visualizations (interactive diagrams):

Student requests visualization during tutoring
    → Lambda generates HTML/CSS/JS sandbox
    → VisualizationService renders via custom protocol (flowspace-viz://)
    → Interactive content runs securely in isolated frame

Fix Packages live in S3 with pre-signed URL access. Visualizations render on-demand via a custom Electron protocol to bypass CSP restrictions.

Integration Summary

Feature Architecture Pattern Key Component
AI Tutor Facade + services On-device persistence + cache
Brain Dump Minimal pipeline Schema validation
Fix Package Content generation S3 + pre-signed URLs
Visualization On-demand rendering Custom protocol sandbox

(Planned) Phase 5: Deployment - Ship with Confidence

Deployment for generative AI has unique challenges: external API dependencies, unpredictable scaling, and cost control.

Lambda Scaling

Lambda handles scaling automatically, but AI workloads have quirks:

AI API Unavailability

External AI APIs go down. Gemini has had outages. My mitigation strategy:

Cost Control

Deployment Summary

Concern Strategy
Lambda scaling Auto-scale, 60s timeout, monitor cold starts
API unavailability Graceful degradation, retry, no auto-failover
Cost control Rate limits, CloudWatch alarms, budget caps

(Planned) Phase 6: Continuous Improvement - The Work Never Ends

Deployment is the starting point for learning from real users.

The Reality: Limited Visibility

I can’t see individual conversations between students and the AI tutor—nor should I, for privacy reasons. This limits traditional evaluation approaches like reviewing model outputs or analyzing conversation patterns.

Phase 1: User Feedback First

Initially, improvement will rely on explicit user feedback:

This is enough for an early-stage product. Real users will tell you what’s broken.

Phase 2: Telemetry at Scale

As the user base grows, aggregated telemetry becomes viable:

The key is measuring outcomes (did students improve?) rather than inspecting conversations.


Key Takeaways

Match Solution to Problem

Each feature had different requirements, leading to different choices:

Scoping Prevents Waste

The most important decisions happen before any code is written. Understanding your cost tolerance, failure modes, and success metrics determines everything downstream.

Customization is Mostly Prompts

None of these features required fine-tuning. Prompt engineering, output schemas, and context injection handled all customization needs. Start simple.

Build for Iteration

Every feature includes feedback mechanisms. You can’t improve what you don’t measure, and your first version won’t be your best.


Conclusion

The generative AI lifecycle provides a structured approach to building AI features that deliver real value. By working through scoping, model selection, customization, development, deployment, and continuous improvement for each feature, I made intentional decisions rather than defaulting to “use the biggest model for everything.”

The framework isn’t about finding one perfect approach—it’s about matching your approach to your problem. Different features justify different trade-offs, and that’s exactly how it should be.