Why AI Agents Build Mediocre UI — And How To Fix It

Research report for SecurityV0 team — April 2026

1. Executive Summary

The core insight: LLMs produce mediocre UI because they optimize for correctness (does the component render the right data?) rather than communication (does the page tell a story the user can scan in 3 seconds?). Good UX is fundamentally about editorial decisions — what to emphasize, what to hide, what order to present things — and LLMs default to "show everything, equally weighted" because that's the safest, most literally-correct interpretation of any prompt.

The fix isn't better prompts alone. It's constraining the design space so that the "default" output is already good, and adding a visual feedback loop so agents can see (literally) what they've built.

2. Root Cause Analysis — Why AI Agents Produce Mediocre UX

2.1 Training data is code-heavy, design-light

LLMs are trained on vastly more source code than design rationale. They've seen millions of <div className="flex gap-4"> but almost zero of "I chose 24px gap here because the items are related but distinct, and the surrounding section uses 48px gap to create visual separation." The what is in the training data; the why is not.

2.2 No visual feedback loop

A human developer renders the page, squints at it, adjusts. LLMs generate code blind. They never see the rendered output. This is the single biggest gap — you can't do visual design without seeing the result. An experienced developer's "feel" comes from thousands of render-adjust cycles. LLMs have zero.

2.3 Completeness bias

LLMs are trained to be helpful and thorough. When a component has 8 data fields, the default behavior is to render all 8 with equal prominence. A designer would ask "what does the user need to decide here?" and show 3 fields prominently, 2 on hover, and hide 3 behind a detail view. LLMs don't make editorial cuts because omitting data feels like a mistake.

2.4 Flat token-level reasoning vs. spatial reasoning

UI design is inherently 2D-spatial. LLMs reason in 1D token sequences. They can't "see" that a page has too much content above the fold, or that two sections visually compete for attention, or that the eye has no resting point. They reason about DOM structure, not visual weight.

2.5 Missing user model

Good UX requires a model of the user: What are they trying to accomplish? What's their mental model? What have they already seen? LLMs don't maintain this unless explicitly told. They build for an abstract "user" who reads every word linearly — which no real user does.

2.6 Averaging over training data

LLMs produce output that's the statistical average of what they've seen. The average website is mediocre. Great design is opinionated and specific. LLMs trend toward the mean: medium spacing, medium font sizes, medium everything. The result is technically acceptable but has no point of view.

2.7 Microcopy is undertrained

Button labels, empty states, error messages, tooltips — these are rarely the focus of code examples in training data. Labels tend to be technically accurate ("Submit Form") rather than contextually human ("Save Changes" or "Get Started"). The difference is subtle but compounds across an entire UI.

3. Constraints That Work — Design Systems, Tokens, Templates

3.1 Semantic design tokens (not raw Tailwind)

The problem: Tailwind gives you text-sm, text-base, text-lg, etc. An agent can pick any of them. There's no signal about when to use which.

The fix: Create semantic tokens that encode hierarchy:

// tokens.ts — semantic typography
export const typography = {
  pageTitle: "text-2xl font-semibold tracking-tight text-gray-900",
  sectionTitle: "text-lg font-medium text-gray-900",
  cardTitle: "text-base font-medium text-gray-900",
  label: "text-sm font-medium text-gray-700",
  body: "text-sm text-gray-600",
  caption: "text-xs text-gray-500",
  metric: "text-3xl font-bold tabular-nums text-gray-900",
  metricLabel: "text-xs font-medium uppercase tracking-wide text-gray-500",
} as const;

// tokens.ts — semantic spacing
export const spacing = {
  pagePadding: "px-6 py-8",        // outer page
  sectionGap: "space-y-8",          // between page sections
  cardGap: "space-y-4",             // between cards in a grid
  fieldGap: "space-y-3",            // between form fields
  inlineGap: "gap-2",               // between inline elements
  tightGap: "gap-1",                // between icon and label
} as const;

When an agent uses typography.sectionTitle instead of choosing raw Tailwind classes, the hierarchy is enforced automatically. The agent doesn't need taste — the token system has taste built in.

3.2 Layout primitives that enforce reading order

Create page-level layout components that enforce structure:

// PageLayout enforces: title → description → primary content → secondary content
<PageLayout
  title="Vulnerabilities"
  description="Critical issues requiring immediate attention"
  actions={<Button>Export</Button>}
  primary={<VulnerabilityTable />}
  sidebar={<FilterPanel />}
/>

The layout primitive decides spacing, max-widths, responsive behavior. The agent just fills slots. This is the most impactful single change — it removes 80% of layout decisions from the agent.

3.3 Component variants with built-in hierarchy

Don't just have <Card>. Have:

<Card variant="hero">     // large, prominent, for the primary insight
<Card variant="default">  // standard card
<Card variant="compact">  // dense, for secondary info
<Card variant="subtle">   // minimal border, background tint, for tertiary

Each variant has pre-set padding, typography, and visual weight. The agent's job becomes choosing the right variant, not tweaking individual styles.

3.4 A page template library

For a security platform, you probably have ~5-7 page archetypes:

Dashboard — metrics up top, activity feed, quick actions
List/Table — filters, data table, bulk actions
Detail — header with status, tabbed sections, related items
Settings — grouped form sections
Empty/Onboarding — illustration, explanation, CTA
Wizard/Flow — multi-step with progress

Create a template for each. When an agent builds a new page, they start from the closest template, not from scratch. This is how design teams work (Figma templates) — agents should work the same way.

3.5 Content density rules

Encode explicit rules:

Dashboard cards: Max 1 metric + 1 label + 1 trend indicator per card
Table rows: Max 6 visible columns; additional columns via column picker
Detail pages: Max 3 sections above the fold; remaining in tabs
Forms: Max 5 fields visible at once; group with accordions

These rules prevent the "data dump" problem directly.

4. Skills and Prompts — What Context to Give AI Agents

4.1 A UX System Prompt (for inclusion in agent context)

## UX Principles — Apply to ALL UI work

### Hierarchy First
Every page has ONE primary message. Before writing any JSX, state:
1. What is the user trying to DO on this page?
2. What is the FIRST thing they should see?
3. What can be HIDDEN until needed?

### The 3-Second Rule
If a user glances at this page for 3 seconds, they should understand:
- Where they are
- What the main content is
- What action to take (if any)

### Progressive Disclosure
Default to LESS. Show summary first, details on demand.
- Tables: 5-6 columns max visible. Offer column picker.
- Cards: metric + label + trend. Details on click.
- Forms: group related fields. Use accordions for advanced options.
- Lists: show top 5-10 items. "Show all" link for the rest.

### Visual Hierarchy Checklist
Before finalizing any component:
- [ ] Is there exactly ONE dominant element? (largest text, brightest color)
- [ ] Do related items share proximity and styling?
- [ ] Is there whitespace separating distinct groups?
- [ ] Does the eye flow top→down or left→right naturally?
- [ ] Are secondary actions visually quieter than primary actions?

### Microcopy Rules
- Button labels: verb + noun ("Export Report", "Add Filter")
- Use sentence case, not Title Case
- Empty states: explain why + what to do ("No vulnerabilities found. Import a scan to get started.")
- Error messages: what happened + how to fix it
- Prefer natural language over technical jargon in user-facing text

### Spacing
Use semantic spacing tokens. When in doubt, add MORE whitespace, not less.
Pages should breathe. Dense UIs are for power users with explicit opt-in.

### Don't
- Don't render all available data fields. Choose the most important ones.
- Don't use the same visual weight for all sections.
- Don't put two competing CTAs at the same prominence level.
- Don't create walls of text. Use bullet points, spacing, and hierarchy.
- Don't forget empty states, loading states, and error states.

4.2 Design Critique Prompt (run as a review step)

After generating UI code, run this as a second pass:

Review this UI component/page for UX quality. Check:

**Squint test:** If you blur this page, do exactly 1-2 elements dominate visually?
**Information hierarchy:** Is there a clear primary → secondary → tertiary flow?
**Data density:** Is anything shown that could be hidden behind progressive disclosure?
**Reading flow:** Does it follow F-pattern (content) or Z-pattern (landing)?
**Whitespace:** Are there cramped areas? Does the page breathe?
**Microcopy:** Are labels human-natural or robotic?
**Empty/loading/error states:** Are they handled gracefully?
**Action clarity:** Is the primary action obvious? Are destructive actions guarded?

For each issue found, provide a specific fix with code.

4.3 Reference screenshots

The most effective context you can give an agent is a screenshot of a page that "feels right." Vision-capable models can extract design patterns from screenshots far more effectively than from text descriptions.

Practical approach:

Curate 5-10 screenshots of well-designed security/SaaS dashboards (Linear, Vercel, Datadog, Tailscale, 1Password)
Store them in the repo under docs/design-references/
When assigning a UI task, include: "Reference the visual style and information density of [reference].png"

4.4 A "UX cheatsheet" per page type

Instead of generic UX guidance, create specific cheatsheets:

## Dashboard Page Cheatsheet
- Top row: 3-4 metric cards (KPI → trend → sparkline)
- Metrics use `typography.metric` for the number, `typography.metricLabel` for the label
- Below metrics: 1 primary content area (table or chart)
- Right sidebar or bottom: secondary content (activity log, quick actions)
- Page actions (Export, Settings) in top-right, `variant="outline"`
- No more than 3 distinct visual "zones" above the fold

5. Workflow Changes — How to Restructure the Development Process

5.1 Add a UX review step (before code review)

Current flow: Task → Code → PR → Code Review → Merge

Proposed flow: Task → Code → Screenshot → UX Review → Fix → PR → Code Review → Merge

The UX review can be automated (see 5.3) or done by a human. The key insight: catching UX issues after code is written is expensive. Catching them before PR creation is cheap.

5.2 Generate wireframes before code (for complex pages)

For new pages or major redesigns, have the agent describe the layout in structured text before writing JSX:

Page: Vulnerability Detail
Layout: Detail page template

Zone 1 (Header):
  - Severity badge (left) | Title (center-left) | Status dropdown (right)
  - Subtitle: asset name + first detected date

Zone 2 (Tabs):
  - Overview (default) | Affected Assets | Remediation | History

Zone 3 (Overview tab):
  - Left column (2/3): Description, Evidence, CVSS breakdown
  - Right column (1/3): Quick facts card (score, vector, CWE, references)

This takes 30 seconds for the agent to generate and 10 seconds for a human to approve/adjust. It prevents building the wrong structure entirely.

5.3 Screenshot-based feedback loop (vision model review)

This is the highest-ROI workflow change:

Agent generates UI code
Agent (or CI) renders the page in a headless browser and takes a screenshot
A vision model evaluates the screenshot against UX criteria
Issues are fed back to the coding agent for fixes
Repeat until the vision model passes the page

Implementation with OpenClaw:

Use the browser tool to navigate to the dev server and take screenshots
Feed screenshots to a vision model with the design critique prompt
The agent sees its own output and can self-correct

This closes the visual feedback loop that LLMs fundamentally lack. It's the closest thing to giving an LLM "eyes."

5.4 Dedicated UX Review Agent

Create a specialized agent (or sub-agent step) that:

Takes a PR with UI changes
Checks out the branch, runs the dev server
Screenshots every changed page/component
Evaluates against the UX checklist
Posts review comments with specific fixes

This agent doesn't write code — it only reviews. Like having a designer on the team who does design QA.

5.5 How professional design teams work with AI

The pattern emerging in professional teams:

Designer creates high-fidelity mockups in Figma
AI agent implements the mockup (using vision input or Figma → code tools)
Visual diff catches deviations from the mockup
Designer reviews the implementation

For teams without a designer (like SecurityV0), the substitute is:

Reference designs + page templates replace the mockup step
Vision model review replaces the designer review step
Strict design tokens replace the design system the designer would maintain

6. Concrete OpenClaw Skill Proposal — The "UX Skill"

Skill: `ux-review`

# UX Review Skill

## Trigger
Automatically invoked when a PR contains changes to `.tsx` files in `src/components/` or `src/pages/`.

## Process

### Step 1: Identify changed pages
Parse the PR diff to find which pages/components changed.

### Step 2: Screenshot
For each changed page:
1. Start the dev server (`npm run dev`)
2. Use `browser` tool to navigate to the page
3. Take a full-page screenshot
4. Take a viewport-only screenshot (1440x900)

### Step 3: Visual analysis
Send each screenshot to a vision model with this prompt:

"Analyze this UI screenshot for a B2B security platform. Evaluate:
1. Information hierarchy (is there a clear visual priority?)
2. Whitespace and breathing room
3. Typography scale (do heading levels create clear hierarchy?)
4. Content density (is it overwhelming or well-paced?)
5. Visual grouping (are related items proximate?)
6. Action clarity (can you identify the primary CTA?)
7. Overall 'feel' — does this look professional and polished?

Rate each 1-5. For anything rated 3 or below, provide a specific fix."

### Step 4: Automated checks
Run static analysis on the code:
- [ ] Uses semantic tokens (not raw Tailwind for typography/spacing)
- [ ] Page uses a layout template
- [ ] Cards have appropriate variant props
- [ ] No more than 6 table columns without column picker
- [ ] Empty states defined for data-dependent components
- [ ] Loading states defined
- [ ] Primary actions use `variant="default"`, secondary use `variant="outline"`

### Step 5: Report
Post a review comment on the PR with:
- Screenshot of each changed page
- Vision model assessment
- Static analysis results
- Specific fix suggestions with code snippets

Skill: `ux-guide`

A context skill that injects UX knowledge into the agent's system prompt when working on UI tasks:

# UX Guide Skill

## Trigger
Loaded into agent context when the task involves UI/frontend work.

## Contents
- UX principles (from Section 4.1 above)
- Semantic token reference
- Page template catalog
- Component variant guide
- Microcopy style guide
- Design reference screenshots (paths)

7. Quick Wins — Things You Can Do THIS WEEK

Win 1: Create semantic typography tokens (2 hours)

Define typography.pageTitle, typography.sectionTitle, etc. in a shared tokens file. Update 2-3 existing pages to use them as proof of concept. This immediately prevents "everything is text-sm" syndrome.

Win 2: Write 5 page templates (3 hours)

Dashboard, List, Detail, Settings, Empty. Each template is a layout component with named slots. Agents fill slots instead of designing layouts from scratch.

Win 3: Add the UX system prompt to agent context (30 minutes)

Copy section 4.1 into AGENTS.md or a skill file. Every UI task now gets UX guidance automatically.

Win 4: Screenshot self-review (1 hour)

Add a step to the agent workflow: after generating UI code, use the browser tool to screenshot the result, then self-critique using the design critique prompt (section 4.2). This is the cheapest way to close the visual feedback loop.

Win 5: Curate 5 reference screenshots (1 hour)

Find 5 well-designed security/SaaS UIs. Save them to the repo. Reference them in UI task prompts. Visual examples are worth 1000 words of design guidance.

Win 6: Define content density rules (30 minutes)

Write explicit rules: "Max N columns in tables", "Max N fields visible in forms", "Max N cards in a dashboard row." Add to the UX system prompt. This directly prevents data dumps.

Win 7: Microcopy style guide (1 hour)

Write 20 examples of bad→good microcopy for your specific domain:

"Submit" → "Save changes"
"No data" → "No vulnerabilities found. Run a scan to get started."
"Error" → "Couldn't load assets. Check your connection and try again."
"Delete" → "Remove permanently"

Appendix: The Fundamental Insight

The gap between AI-generated UI and human-designed UI is not about code quality. It's about editorial judgment — the ability to decide what matters most, what to show first, what to hide, and how to guide attention.

You can't prompt your way to good editorial judgment (though prompts help). You need to:

Constrain the space so the default output is good (tokens, templates, variants)
Close the feedback loop so the agent can see and self-correct (screenshots + vision)
Encode the rules that experienced designers carry in their heads (UX system prompt)

Do all three and the "many iterations with the CEO" should drop significantly. The remaining iterations will be about product decisions (what to show) rather than design polish (how it looks) — which is where human judgment should be spent anyway.

Next Action

Status: research-complete — input to [[combined-ux-strategy]] Decision needed from: CTO (Ivan) See: [[combined-ux-strategy]] for synthesized recommendations and decision options

— Delta (sv0-delta)

1. Executive Summary​

2. Root Cause Analysis — Why AI Agents Produce Mediocre UX​

2.1 Training data is code-heavy, design-light​

2.2 No visual feedback loop​

2.3 Completeness bias​

2.4 Flat token-level reasoning vs. spatial reasoning​

2.5 Missing user model​

2.6 Averaging over training data​

2.7 Microcopy is undertrained​

3. Constraints That Work — Design Systems, Tokens, Templates​

3.1 Semantic design tokens (not raw Tailwind)​

3.2 Layout primitives that enforce reading order​

3.3 Component variants with built-in hierarchy​

3.4 A page template library​

3.5 Content density rules​

4. Skills and Prompts — What Context to Give AI Agents​

4.1 A UX System Prompt (for inclusion in agent context)​

4.2 Design Critique Prompt (run as a review step)​

4.3 Reference screenshots​

4.4 A "UX cheatsheet" per page type​

5. Workflow Changes — How to Restructure the Development Process​

5.1 Add a UX review step (before code review)​

5.2 Generate wireframes before code (for complex pages)​

5.3 Screenshot-based feedback loop (vision model review)​

5.4 Dedicated UX Review Agent​

5.5 How professional design teams work with AI​

6. Concrete OpenClaw Skill Proposal — The "UX Skill"​

Skill: ux-review​

Skill: ux-guide​

7. Quick Wins — Things You Can Do THIS WEEK​

Win 1: Create semantic typography tokens (2 hours)​

Win 2: Write 5 page templates (3 hours)​

Win 3: Add the UX system prompt to agent context (30 minutes)​

Win 4: Screenshot self-review (1 hour)​

Win 5: Curate 5 reference screenshots (1 hour)​

Win 6: Define content density rules (30 minutes)​

Win 7: Microcopy style guide (1 hour)​

Appendix: The Fundamental Insight​

Next Action​