December 30, 2025 14 min to read

No Fine-Tuning, No Problem: Building 4 AI Features with AWS Well-Architected

A practical guide to the generative AI lifecycle

No Fine-Tuning, No Problem: Building 4 AI Features with AWS Well-Architected

Building production-ready generative AI applications requires more than just picking a model and writing prompts. It demands a structured approach that considers business value, technical feasibility, and long-term sustainability.

I used the AWS Well-Architected Generative AI Lens as my framework—a six-phase lifecycle that forces you to think before you build.

In this post, I’ll walk through each phase and show how I applied this framework to build four AI features for FlowSpace—an assignment management app for students and educators.

The Four Features

Before diving into the lifecycle, here’s what we’re building:

AI Tutor (Socratic Engine) - Interactive problem-solving with MCQ verification and concept mastery tracking
Brain Dump & Assignment Extraction - Extract actionable tasks from natural language or uploaded documents (syllabi, assignment sheets)
Task Priority Engine - Intelligent task ranking with personalized recommendations
Professor Fix Package & Tutor Visualization Engine - Educational content packages with misconception corrections, and interactive visualizations for tutoring

Each feature has different requirements, leading to different decisions at each phase.

Phase 1: Scoping - Start with the Business Problem

The scoping phase prioritizes understanding the business problem before any technical decisions. I ask: Is generative AI the right solution? What does success look like? What are the real costs?

How I Scoped Each Feature

AI Tutor (Socratic Engine)

The core question: What happens when the model gives a wrong answer?

This isn’t a simple Q&A chatbot. The Socratic Engine guides students through problems step-by-step, generates MCQs to verify understanding after each step, and tracks concept mastery using EMA scoring. It also provides meta-actions: Visualize (explain with diagrams), Breakdown (decompose problems), and Ask (free-form questions).

A wrong response doesn’t just waste time—it teaches incorrect concepts that the mastery system then reinforces. The cost of hallucination (student learning wrong material, eroded trust, academic harm) far exceeds the cost of using a premium model.

Success metric: < 2% factual error rate in step explanations and MCQ generation

Cost tolerance: High—accuracy trumps cost

Brain Dump & Assignment Extraction

The core question: What are the failure modes?

Two critical requirements emerged:

No invented tasks - The model must never hallucinate tasks that weren’t mentioned
No missed tasks - When all generated subtasks are completed, the original assignment must be done

Students might brain dump multiple times daily, so cost matters. But the real constraint was finding a model with low hallucination AND moderate completeness—cheap models that miss tasks or invent phantom tasks both fail the user.

Success metric: Zero invented tasks, > 95% task coverage

Cost tolerance: Low—but not at the expense of reliability

Task Priority Engine

The core question: Do we even need an LLM for this?

Initially, I planned to use an LLM for nuanced prioritization. But during evaluation, I realized: prioritization is fundamentally about applying consistent rules to structured data. The Eisenhower matrix (urgent/important) combined with custom variables (deadline proximity, estimated effort, current progress) could handle this deterministically—faster, cheaper, and more predictable than an LLM.

Success metric: User follows recommendation > 70% of the time

Cost tolerance: Zero—no LLM means no per-request cost

Professor Fix Package & Tutor Visualization Engine

The core question: What content do professors actually need?

Two distinct but related needs emerged:

Fix Packages - Ready-made educational content to address common student misconceptions: examples that correct wrong mental models, clear explanations, and step-by-step corrections
Tutor Visualizations - Interactive diagrams and visual aids that clarify abstract concepts during tutoring sessions

Both require strong code generation (for interactive visualizations) and pedagogical accuracy. The content must be correct—a visualization that reinforces a misconception is worse than no visualization at all.

Success metric: Generated content is pedagogically accurate and visualizations render correctly > 95%

Cost tolerance: Medium—quality over cost, but not unlimited

Scoping Summary

Feature	Primary Risk	Cost Tolerance	Complexity
AI Tutor	Hallucination harms learning	High	Single model
Brain Dump	Invented or missed tasks	Low-Mid	Single model
Priority Engine	Inconsistent recommendations	Zero	No LLM needed
Fix Package/Viz	Pedagogical accuracy matters	Medium	Single model

Phase 2: Model Selection - Match Capabilities to Requirements

There is no ‘silver-bullet’ model—it’s about matching capabilities to your specific constraints.

My Evaluation Process

I considered a wide range of options: AWS Bedrock models (Claude, Titan), OpenAI’s GPT-4o and GPT-5.2, Google’s Gemini family, and even on-device options like Apple’s Foundation Models.

Bedrock was tempting—easy cost management (I’m already on AWS), seamless integration with Lambda, unified billing. But many Bedrock models disappointed on price-performance ratio for my specific use cases.

Selection criteria:

Price - Cost per token, free tiers, volume discounts
Performance - Benchmarks relevant to my use cases (reasoning, extraction, code generation)
Real user reviews - Community feedback on reliability, edge cases, API stability
Hands-on testing - I ran my own evaluations with real prompts before committing

No amount of benchmark reading replaces actually testing with your data.

How I Selected Models

AI Tutor (Socratic Engine)

Requirements drove me toward higher-tier models:

Strong reasoning for step-by-step explanations
Low hallucination rate on factual content
Ability to admit uncertainty rather than fabricate

I evaluated models on math and science tutoring benchmarks, specifically looking at reasoning trace quality, not just final answer accuracy. The model needed to “show its work” correctly.

The choice came down to Claude Opus 4.5 vs Gemini 3.0 Pro. While Opus 4.5’s pricing had dropped, Gemini 3.0 Pro was still more cost-effective. In practice, 3.0 Pro’s reasoning quality exceeded my expectations for tutoring use cases.

Selected: Gemini 3.0 Pro

Brain Dump & Assignment Extraction

This is fundamentally an extraction task, but with a tricky balance: the model must extract all relevant tasks (completeness) without inventing tasks that weren’t there (no hallucination).

Evaluation criteria:

Hallucination rate - Must be near-zero for phantom tasks
Coverage - Must capture all mentioned tasks
Cost - High-frequency feature needs sustainable pricing

I tested multiple models including on-device options like Apple’s Foundation Models. Some cheap models had good coverage but invented tasks. Others never hallucinated but missed obvious items. Finding the right balance took about two months of iteration. (I’ll share the detailed lessons from evaluating Apple’s Foundation Models in a future post.)

Selected: Gemini 2.5 Flash (no thinking mode)

Professor Fix Package & Tutor Visualization Engine

Model longevity was a key concern. I tracked that Gemini 2.0 Flash deprecation was scheduled for February 2026. When 3.0 Pro was released, I knew 3.0 Flash was imminent. Rather than building on a soon-to-deprecate model, I waited.

When Gemini 3.0 Flash launched, I ran early evaluations on:

Pedagogical accuracy in misconception explanations
Interactive visualization code generation (React components)
Structured concept breakdowns

The results were exceptional—better than models I’d used previously at similar cost points, especially for generating educational content with embedded code for visualizations.

Selected: Gemini 3.0 Flash

Model Selection Summary

Feature	Model Selected	Key Selection Criterion
AI Tutor	Gemini 3.0 Pro	Reasoning quality at reasonable cost
Brain Dump	Gemini 2.5 Flash (no thinking)	Low hallucination + high coverage
Fix Package/Viz	Gemini 3.0 Flash	Educational content + visualization quality

Phase 3: Model Customization - Make It Yours

I haven’t done traditional model customization (fine-tuning, RLHF) for these features yet. Instead, I’m building a modular prompt management system inspired by Claude’s Agent Skills.

The Architecture: Feature Lambdas + Bedrock Orchestrator

Each AI feature has its own Lambda endpoint with tailored prompts and functionality. For the Tutor (the most complex feature), a cheap Bedrock model acts as the orchestrator—selecting the most appropriate skill file from S3 based on the request.

Client Request
    → Feature-specific Lambda
    → (Tutor only) Bedrock orchestrator selects skill from S3
    → Assemble prompt with context + instructions
    → Call AI API (Gemini)
    → Validate response against schema
    → Return structured result

Why this approach?

Feature isolation - Each Lambda has focused responsibility, easier to debug and scale independently
Smart routing (Tutor) - Cheap orchestrator model picks the right skill without hardcoded logic
Hot-swappable skills - Update prompts in S3 without redeploying
Cost visibility - Per-feature usage tracking

(For why I chose Lambda over alternatives like AWS Bedrock AgentCore, see my infrastructure decision analysis.)

How I Customized Each Feature

AI Tutor (Socratic Engine)

The Tutor is the most complex feature, so it gets dedicated S3 skill definitions. The orchestrator routes tutoring requests to skills containing:

Socratic questioning patterns - Never give direct answers; guide through questions
MCQ generation rules - Questions must verify understanding, not just recall
Concept extraction schema - Normalized concept names for consistent mastery tracking

The client sends the full conversation context with each request, keeping the Lambda stateless. Session state (current step, hint level, mastery scores) is managed on-device.

Guardrails (AWS best practice): I’m implementing guardrails to enforce academic integrity:

No direct answers - Reject responses that give solutions without Socratic guidance
Content filtering - Block harmful, inappropriate, or off-topic content
Fallback behavior - When guardrails trigger, return a safe response (“Let’s approach this differently”) rather than failing silently

Brain Dump & Assignment Extraction

Strict schema enforcement was the primary customization:

Extract ONLY tasks explicitly mentioned or clearly implied.
Never invent tasks the user didn't reference.
If text is ambiguous, extract nothing rather than guess.

The prompt includes few-shot examples showing correct extraction (including examples of what NOT to extract). Output schema validation rejects responses that don’t conform.

Professor Fix Package & Tutor Visualization Engine

Prompt templates optimized for educational content generation:

Misconception identification with correct mental model explanations
Visualization specifications (what concept to illustrate, interaction patterns)
Structured concept breakdowns with progressive complexity
Output format for both explanatory text and embedded React visualization code

The key innovation: each “fix package” combines narrative explanation with interactive visualization code, generated together to ensure conceptual alignment.

Customization Summary

Feature	Primary Technique	Core Focus
AI Tutor	S3 skill injection	Socratic patterns + MCQ generation
Brain Dump	Schema enforcement	Strict extraction-only rules
Fix Package/Viz	Template prompts	Paired explanation + visualization

(In Progress) Phase 4: Development & Integration - Build the System

The model is often the smallest component of a production generative AI system. Integration brings everything together.

How I Built Each Feature

AI Tutor (Socratic Engine)

Clear separation between cloud and device:

Lambda → AI inference only (Socratic responses, MCQ generation)
On-device → Session state, mastery scores, history (SQLite)
Visualizations → Custom Electron protocol (flowspace-viz://) for secure sandbox rendering

Brain Dump & Assignment Extraction

Two input modes, same extraction pipeline:

Input (natural language OR uploaded document)
    → Lambda (multimodal LLM handles both text and document parsing)
    → Schema validation (reject if non-conformant)
    → DynamoDB update
    → Return structured tasks

No separate OCR step—the multimodal model handles document parsing directly. No RAG, no complex orchestration.

Professor Fix Package & Tutor Visualization Engine

Two distinct pipelines for different content types:

Fix Packages (static educational content):

Professor selects misconception topic
    → Lambda generates explanation + examples
    → Professor reviews and approves
    → Package uploaded to S3
    → Students access via pre-signed URLs

Tutor Visualizations (interactive diagrams):

Student requests visualization during tutoring
    → Lambda generates HTML/CSS/JS sandbox
    → VisualizationService renders via custom protocol (flowspace-viz://)
    → Interactive content runs securely in isolated frame

Fix Packages live in S3 with pre-signed URL access. Visualizations render on-demand via a custom Electron protocol to bypass CSP restrictions.

Integration Summary

Feature	Architecture Pattern	Key Component
AI Tutor	Facade + services	On-device persistence + cache
Brain Dump	Minimal pipeline	Schema validation
Fix Package	Content generation	S3 + pre-signed URLs
Visualization	On-demand rendering	Custom protocol sandbox

(Planned) Phase 5: Deployment - Ship with Confidence

Deployment for generative AI has unique challenges: external API dependencies, unpredictable scaling, and cost control.

Lambda Scaling

Lambda handles scaling automatically, but AI workloads have quirks:

Cold starts - First request after idle period adds latency. For tutoring, this is acceptable (users expect some thinking time). For Brain Dump, I’m considering provisioned concurrency if it becomes an issue.
Timeout configuration - AI inference can take 10-30 seconds. API Gateway has a default 29-second timeout, but you can request AWS to increase it to 5 minutes. Much simpler than implementing SQS-based async patterns for long-running requests.
Memory allocation - More memory = faster CPU. I’m starting at 512MB and will adjust based on actual performance data.

AI API Unavailability

External AI APIs go down. Gemini has had outages. My mitigation strategy:

Graceful degradation - Show cached/simpler responses rather than error screens. For Tutor: “I’m having trouble thinking right now. Try again in a moment.”
Retry with backoff - 3 retries with exponential backoff before giving up
No automatic failover to different models - Different models behave differently. I’d rather show an error than give inconsistent responses from a backup model I haven’t tested thoroughly.

Cost Control

Per-user rate limits - Prevent runaway costs from heavy users or abuse
CloudWatch alarms - Alert when daily spend exceeds thresholds
Budget caps - AWS Budgets with hard limits as a safety net

Deployment Summary

Concern	Strategy
Lambda scaling	Auto-scale, 60s timeout, monitor cold starts
API unavailability	Graceful degradation, retry, no auto-failover
Cost control	Rate limits, CloudWatch alarms, budget caps

(Planned) Phase 6: Continuous Improvement - The Work Never Ends

Deployment is the starting point for learning from real users.

The Reality: Limited Visibility

I can’t see individual conversations between students and the AI tutor—nor should I, for privacy reasons. This limits traditional evaluation approaches like reviewing model outputs or analyzing conversation patterns.

Phase 1: User Feedback First

Initially, improvement will rely on explicit user feedback:

Bug reports and feature requests
Direct communication with early adopters

This is enough for an early-stage product. Real users will tell you what’s broken.

Phase 2: Telemetry at Scale

As the user base grows, aggregated telemetry becomes viable:

Reuse rates - Are students returning to the tutor? Which features get repeated use?
Rating systems - Star ratings for generated content, professor approval rates for Fix Packages
Cost trends - Per-feature usage patterns to evaluate model upgrades

The key is measuring outcomes (did students improve?) rather than inspecting conversations.

Key Takeaways

Match Solution to Problem

Each feature had different requirements, leading to different choices:

High-stakes accuracy → premium models (Gemini 3.0 Pro)
Balanced extraction (no hallucination + high coverage) → mid-tier models with careful evaluation (Gemini 2.5 Flash)
Educational content generation → evaluate new models early (Gemini 3.0 Flash)

Scoping Prevents Waste

The most important decisions happen before any code is written. Understanding your cost tolerance, failure modes, and success metrics determines everything downstream.

Customization is Mostly Prompts

None of these features required fine-tuning. Prompt engineering, output schemas, and context injection handled all customization needs. Start simple.

Build for Iteration

Every feature includes feedback mechanisms. You can’t improve what you don’t measure, and your first version won’t be your best.

Conclusion

The generative AI lifecycle provides a structured approach to building AI features that deliver real value. By working through scoping, model selection, customization, development, deployment, and continuous improvement for each feature, I made intentional decisions rather than defaulting to “use the biggest model for everything.”

The framework isn’t about finding one perfect approach—it’s about matching your approach to your problem. Different features justify different trade-offs, and that’s exactly how it should be.

Sunho's Dev Blog v3.1.2

No Fine-Tuning, No Problem: Building 4 AI Features with AWS Well-Architected

No Fine-Tuning, No Problem: Building 4 AI Features with AWS Well-Architected

The Four Features

Phase 1: Scoping - Start with the Business Problem

How I Scoped Each Feature

Scoping Summary

Phase 2: Model Selection - Match Capabilities to Requirements

My Evaluation Process

How I Selected Models

Model Selection Summary

Phase 3: Model Customization - Make It Yours

The Architecture: Feature Lambdas + Bedrock Orchestrator

How I Customized Each Feature

Customization Summary

(In Progress) Phase 4: Development & Integration - Build the System

How I Built Each Feature

Integration Summary

(Planned) Phase 5: Deployment - Ship with Confidence

Lambda Scaling

AI API Unavailability

Cost Control

Deployment Summary

(Planned) Phase 6: Continuous Improvement - The Work Never Ends

The Reality: Limited Visibility

Phase 1: User Feedback First

Phase 2: Telemetry at Scale

Key Takeaways

Match Solution to Problem

Scoping Prevents Waste

Customization is Mostly Prompts

Build for Iteration

Conclusion

Why I Ditched Apple's Foundation Model

Sunho

No Fine-Tuning, No Problem: Building 4 AI Features with AWS Well-Architected

No Fine-Tuning, No Problem: Building 4 AI Features with AWS Well-Architected

The Four Features

Phase 1: Scoping - Start with the Business Problem

How I Scoped Each Feature

Scoping Summary

Phase 2: Model Selection - Match Capabilities to Requirements

My Evaluation Process

How I Selected Models

Model Selection Summary

Phase 3: Model Customization - Make It Yours

The Architecture: Feature Lambdas + Bedrock Orchestrator

How I Customized Each Feature

Customization Summary

(In Progress) Phase 4: Development & Integration - Build the System

How I Built Each Feature

Integration Summary

(Planned) Phase 5: Deployment - Ship with Confidence

Lambda Scaling

AI API Unavailability

Cost Control

Deployment Summary

(Planned) Phase 6: Continuous Improvement - The Work Never Ends

The Reality: Limited Visibility

Phase 1: User Feedback First

Phase 2: Telemetry at Scale

Key Takeaways

Match Solution to Problem

Scoping Prevents Waste

Customization is Mostly Prompts

Build for Iteration

Conclusion

Why I Ditched Apple's Foundation Model

Share

Sunho