Research: Making AI Coding Agents Produce Better UI/UX Outcomes

1. Executive Summary

The core problem is not that the agents cannot write UI code. It is that they are optimizing for code correctness and local component quality, while good product UI depends on page-level hierarchy, narrative flow, domain language, visual rhythm, and judgment.

That is the "taste gap": LLMs can generate syntactically valid React/Tailwind quickly, but they do not naturally prioritize information the way an experienced designer or product-minded frontend engineer would.

What successful teams do in practice is not "let the model freestyle the UI." They constrain it with:

Design systems and component libraries
Rules/guardrails encoded close to the code
Examples, templates, and reference UIs
Visual review loops using screenshots, Storybook, Chromatic, Playwright, and humans
Workflow separation between generation and critique
Evaluations against explicit criteria, not vibes alone

For your OpenClaw-based agents, the biggest gains will likely come from process changes, not smarter one-shot prompting:

Require a text wireframe / page plan before coding
Create a UX skill that encodes hierarchy, spacing, copywriting, and progressive-disclosure rules
Require rendered screenshots before submission
Add a separate UX critic pass on every UI task/PR
Use page templates, not just reusable components
Review pages against heuristics (scanability, information hierarchy, action clarity, evidence trust)

For a security product specifically, good UX should feel:

Scannable under time pressure
Authoritative, not playful
Evidence-first when showing risk or findings
Clear about urgency and next actions
Trustworthy and accountable in how decisions are presented

The most important operational change: stop treating UI generation as "write components that satisfy the ticket" and start treating it as "design a page experience, then implement it."

2. How teams ship good AI-generated UI today (real examples)

2.1 The common pattern: AI inside a constrained system

The strongest pattern across tools is that teams succeed when AI is used inside a design and review system, not as a replacement for one.

v0 / Vercel pattern

From v0 docs:

v0 positions itself as an AI agent that can create real code and full-stack apps
It explicitly supports creating high-fidelity UIs from wireframes/mockups
It emphasizes templates, design systems, live preview, repo sync, and pull requests
It is used by designers to clone pages from screenshots/Figma and by engineers to scaffold components quickly

Implication: even Vercel’s own framing is not "prompt a beautiful app from nothing." It is:

start from a visual reference, template, or existing system
generate quickly
refine visually
connect to existing code and review flows

That matters for your team: reference inputs and system constraints are first-class, not optional.

2.2 Cursor customer pattern: rules + internal toolkit + standards

Box

Cursor’s Box case study is especially relevant.

Reported outcomes:

85%+ of developers use Cursor daily
30–50% increase in roadmap throughput
React migration completed ~80% faster than expected
large design system migration completed ~90% faster than expected

But the key detail is not speed. It is method:

Box built a standard "AI toolkit" for frontend development using custom Cursor rules
They defined agent guardrails directly in code
They used rules so Cursor could understand exactly how components should be structured

This is highly relevant to OpenClaw skills. The lesson is:

don’t leave taste in a wiki doc
encode it into the agent’s operating context
keep it near the code and near the generation step

Salesforce

Cursor’s Salesforce case study shows a different but useful pattern:

teams measured cycle time, quality, and throughput
adoption started with boring tasks, then expanded as trust grew
quality gains came partly from more generated tests and broader SDLC usage

Implication for UI work:

teams do not adopt agent-generated UI by trusting it immediately
they create trust through measurable guardrails and iterative scope expansion

2.3 Design-system-first orgs win because AI has better constraints

Microsoft’s Fluent design system is a reminder that mature orgs ship coherent UX because they provide:

component standards
content guidance
accessibility resources
cross-platform patterns
design tooling and developer tooling

AI benefits disproportionately from this kind of structure. A human designer can compensate for weak constraints with judgment. An LLM usually cannot.

So the practical takeaway is:

a strong design system is necessary but insufficient
AI also needs page-level composition rules, copy rules, and examples of good hierarchy

2.4 What teams that ship AI UI successfully actually do

Across these examples, the recurring success pattern is:

Use AI to draft, not to decide everything
Anchor output in an existing design language
Encode standards as rules or reusable prompts
Generate from references (wireframes, screenshots, Figma, templates)
Review rendered output visually, not just code
Use CI checks for regressions and accessibility

In other words: good AI-generated UI is usually AI-accelerated systematized UI, not raw model taste.

3. The taste gap — root causes and mitigations

3.1 Root cause: models optimize local validity, not holistic UX

The symptoms you described map well to typical LLM failure modes:

Vertical data dump → the model serializes all requirements into sections
Flat typography → the model knows semantic HTML but lacks judgment about emphasis
Correct but lifeless spacing → it applies utility classes mechanically
Accurate but unnatural labels → it mirrors domain input or ticket language too literally
No progressive disclosure → it fears omission more than overload
No page story → it treats pages as containers, not guided experiences
Components don’t compose → it solves each block independently

This happens because LLMs are usually rewarded for:

completeness
correctness
explicitness
satisfying every requirement mentioned

But good UX often requires:

omission
prioritization
grouping
implied hierarchy
pacing
editorial judgment

3.2 Why "good Tailwind" is not the same as good design

Agents often know:

space-y-6
text-sm text-muted-foreground
rounded-lg border
grid and flex patterns

But good UX depends on:

where visual emphasis should accumulate
what should be visible in the first 5 seconds
what metadata should be hidden or subordinate
which actions deserve primacy
how sections should ladder from summary to evidence to action

Tailwind literacy solves implementation. It does not solve composition.

3.3 Mitigation: encode taste as decision rules, not aspirations

A vague instruction like "make it polished" does not travel well.

A much better approach is to encode taste into explicit rules such as:

Every page must have a primary takeaway above the fold
No more than 3 visually competing sections on initial load
Each page must define: headline, status summary, evidence, recommended action
Metadata should default to secondary styling and/or collapsible containers
Use only one primary action per screen state
Prefer summary first, details on demand
Headings must create a clear size and weight ladder
Avoid walls of equal cards or equal sections

This is the difference between "taste" and "operationalized taste."

3.4 Reference designs and screenshots are one of the strongest fixes

v0 explicitly supports generating from wireframes/mockups/screenshots. That is consistent with a broader reality: models do better when they can imitate visual relationships instead of inventing them from abstract prose.

Practical implications:

Give the agent screenshots of pages whose hierarchy you like
Give before/after examples of improved pages
Keep a small reference set for dashboards, detail pages, tables, risk reviews, and workflows
Ask the model to explain which structural properties it is reusing

Good reference use is not copying aesthetics blindly. It is constraining:

density
grouping
rhythm
heading hierarchy
action placement
disclosure patterns

3.5 Before/after examples are unusually valuable

If you want better page composition, examples of what changed are often better than rules alone.

Why:

they demonstrate what "too flat" looks like
they show how details are demoted without being lost
they expose better wording for headings and labels
they clarify what "scanable" means in practice

For this team, a useful artifact would be a small internal library of:

bad AI page → improved page
with notes on why the improved version works

This can become training data for prompts, review checklists, and future skills.

3.6 Vision models can help, but mostly as critics

Vision-capable models are useful because they can evaluate the rendered result, which is where many UX failures become obvious.

Anthropic’s vision docs reinforce practical constraints:

images work best when clear and legible
image-first prompting often helps
multiple images can be compared in one request

Strong use cases for vision in your workflow:

critique a screenshot of the rendered page
compare current page vs reference screenshot
compare before/after screenshots and explain improvement
detect flat hierarchy, dense blocks, weak CTA emphasis, clutter

Weak use cases:

using a vision model as the sole final arbiter of quality
expecting it to replace human product judgment

Best role: AI as a first-pass critic before human review.

4. Practical prompt patterns that produce better UX

The highest-leverage prompt improvements are not about adding adjectives. They are about forcing the model to make product/design decisions explicitly.

4.1 Prompt pattern: require a page plan before code

Instead of:

Build the findings detail page in React and Tailwind.

Use:

Before writing code, produce a page plan with:

primary user goal

primary takeaway visible in first screen

information hierarchy from most important to least important

what is hidden by default

primary action and secondary actions

section list with one-line purpose for each section Then implement only after the plan is approved or self-checked.

Why it works:

forces prioritization
reduces the tendency to dump all requirements onto the page
creates an artifact a critic agent can review

4.2 Prompt pattern: specify hierarchy rules numerically

Agents respond better to concrete hierarchy constraints than abstract style language.

Example:

Create strong visual hierarchy.

One clear page title

One summary band above the fold

Max 3 primary information groups before scrolling

Metadata must be visually subordinate to interpretation

If everything looks equally important, reduce emphasis until only the top message dominates

For typography:

Use a clear hierarchy:

page title: largest and boldest

section headings: clearly smaller than title but distinctly stronger than body

labels/meta: smaller and muted

avoid using the same font size/weight for heading, value, and explanatory text

4.3 Prompt pattern: write for scanability, not completeness

Example:

Write the page so a security analyst can understand the situation in 5–10 seconds.

lead with outcome, severity, confidence, owner, and next action

prefer short labels and plain language

summarize before listing evidence

do not make the user read every section to understand the issue

This directly addresses your problem that pages have no story.

4.4 Prompt pattern: require progressive disclosure explicitly

Example:

Do not display all details at once. Default to summary-first. Put secondary data in:

collapsible sections

tabs

drawers

"show details" expansions Surface deeper evidence only where it supports a decision.

Without this instruction, many agents assume all available data should be visible.

4.5 Prompt pattern: copy should sound like a human product designer

Example:

Write labels and helper text the way a human would scan them, not the way a schema or backend field would name them. Prefer:

"Needs review" over "Review status: pending analyst disposition"

"Last seen" over "Most recent observation timestamp"

"Why this matters" over "Risk explanation"

This is especially important in security products, where domain language easily becomes stiff or bureaucratic.

4.6 Prompt pattern: include domain-specific UX goals

For your product, prompts should include security-specific constraints such as:

This is a security operations product. The UI should feel:

calm under pressure

evidence-based

trustworthy

accountable

optimized for triage and review

Prioritize:

severity and confidence visibility

ownership clarity

evidence traceability

action history / accountability

minimizing cognitive load under alert fatigue

This is critical. Generic SaaS UI prompts often produce friendly dashboard UIs, not authoritative security workflows.

4.7 Prompt pattern: use positive examples plus anti-patterns

Negative examples are useful when tied to specific failure modes.

Example anti-pattern section in the prompt:

Avoid these common failures:

long vertical stacks of equal cards

headings and metadata with the same visual weight

showing every field just because data exists

tables without summary interpretation

technically correct labels that sound machine-generated

pages where the primary action is unclear

Then pair with positive examples:

Prefer:

summary banner + prioritized sections

one dominant action area

evidence grouped under clear questions

labels that sound like analyst workflow language

In practice, positive examples plus named anti-patterns work better than either alone.

4.8 Prompt pattern: force self-critique before finishing

Example:

Before finalizing, review the UI against this checklist and revise once:

Can a user grasp the page in 5 seconds?

Is the main takeaway obvious above the fold?

Are there more than 3 competing primary sections?

Is metadata visually subordinate?

Is there progressive disclosure?

Do labels sound natural?

Is the next action obvious?

This is cheap and often improves results materially.

5. UX review automation approaches

Automation will not fully solve UX quality, but it can catch a lot of what your current workflow is missing.

5.1 Accessibility and quality tooling: useful but partial proxies

Lighthouse

Lighthouse is an automated tool for performance, accessibility, SEO, and general page quality. It can run in DevTools, CLI, or CI, and Lighthouse CI can prevent regressions.

What it helps with:

accessibility regressions
performance problems
some UX-adjacent quality issues

What it does not solve:

information hierarchy
poor narrative structure
awkward labels
weak visual rhythm

Use it as a floor, not a substitute for UX review.

axe / Deque

Axe helps teams automate accessibility testing and integrate checks into development workflows.

What it helps with:

catching many accessibility issues early
embedding checks in IDE/build/test pipelines

Again, accessibility is necessary, but not sufficient for good UX.

5.2 Visual regression testing is essential for AI-generated UI

Storybook + Chromatic

Storybook visual tests and Chromatic are especially relevant because they create baselines for every story and detect UI regressions automatically.

Why this matters for AI agents:

AI can unintentionally alter spacing, hierarchy, state styling, and interaction affordances
visual diffing catches regressions that code review misses
Chromatic also supports explicit sign-off and shared UI context, which is valuable for multi-agent systems

Chromatic explicitly frames itself as enforcing UI standards even when AI writes code.

This is a strong fit for your setup if you have Storybook or can add it for critical components/pages.

Playwright snapshots

Playwright’s toHaveScreenshot() provides page- and component-level visual comparison.

Best use here:

page-level screenshots for top workflows
golden snapshots for key states: empty, normal, high severity, overloaded, error
same environment for stable rendering

5.3 AI-powered screenshot critique is the missing middle layer

This is probably the most practical automation addition for your current process.

Workflow:

Render the page locally or in preview
Capture screenshot(s)
Send screenshot plus checklist to a vision-capable model
Ask for critique specifically on:
- hierarchy
- scanability
- copy clarity
- spacing rhythm
- CTA emphasis
- progressive disclosure
- trust/authority fit for security domain
Feed the critique back into one revision pass

This is not a pixel-perfect evaluator. It is a heuristic critic.

That’s still valuable, because your current problem is largely heuristic and compositional.

5.4 Heuristic review checklists can be automated well

NN/g’s 10 heuristics are still useful as a base layer:

match between system and real world
consistency and standards
recognition rather than recall
aesthetic and minimalist design
visibility of system status
error prevention
user control and freedom

For your use case, I would adapt these into a security-product review checklist:

Security UX heuristic layer

Can the user identify severity, scope, confidence, owner, and next step quickly?
Is evidence separate from interpretation, but easy to traverse between them?
Are urgent items visually prominent without making the whole screen scream?
Are actions reversible or safely confirmed where appropriate?
Does the page support accountable review (who changed what, why, when)?
Is dense detail available, but not forced into the first scan?

These checks can be run by a critic agent on every page task.

5.5 Suggested automation stack

A practical stack for your team:

Baseline gates

Typecheck / tests
Lighthouse
axe

Visual gates

Playwright screenshot tests for key pages
Chromatic/Storybook for component and state regressions

AI critique gates

Vision-model screenshot review with structured rubric
optional separate UX critic agent review

Human gates

design/product review on major page changes

6. Proposed OpenClaw UX skill design

The biggest opportunity is to convert good design judgment into a reusable operational skill.

6.1 What the skill should do

The UX skill should not just say "make it polished." It should drive a workflow.

Suggested responsibilities:

interpret the UI task in terms of user goal and page story
force a wireframe/structure pass before coding
apply layout and copy rules during implementation
require post-render screenshot review
run a self-critique checklist
output revision suggestions if the page still feels flat

6.2 Suggested skill contents

A. Design principles as actionable rules

Examples:

Lead with the most decision-relevant information
Do not present more than 3 competing primary regions above the fold
Prefer summary → evidence → action
Demote metadata unless it changes user action
Every page needs one obvious primary action
Every section needs a reason to exist
Use consistent heading/value/meta hierarchy
Avoid equal-weight card stacks
Use progressive disclosure for secondary detail

B. Page-level templates

This is crucial. Component libraries are not enough.

Templates should exist for common security product page types:

dashboard / overview
finding detail page
asset/entity detail page
investigation timeline
exception/review workflow page
queue/table triage page
policy/rule detail page

Each template should define:

above-the-fold structure
default section order
what gets summary treatment
what gets collapsed
typical actions
trust/evidence patterns

C. Copywriting rules

Examples:

headings should answer user questions, not mirror backend models
labels should be short and skimmable
status text should be concrete and active
avoid jargon unless users genuinely use it
helper text should explain implications, not restate labels

D. UX self-review checklist

Minimum checklist:

What is the main takeaway?
Is it visible without scrolling?
What is visually dominant, and is that correct?
What can be hidden by default?
Is the next action obvious?
Do section names sound natural?
Are evidence and action clearly connected?
Does the page feel trustworthy and calm?

E. Reference screenshots / examples

The skill should point to a small local library of:

good dashboard examples
good detail page examples
before/after internal refactors
examples of progressive disclosure
examples of strong typography hierarchy

F. Post-render screenshot review step

The skill should require:

render the page
capture desktop screenshot and maybe narrow viewport screenshot
ask a critic prompt to review the rendered output
revise once before considering task complete

6.3 Proposed skill workflow

A good OpenClaw UX skill could enforce this sequence:

Understand task
- identify page type, user, primary action
Plan UX
- create text wireframe and section hierarchy
Implement
- code using approved template/design system
Render
- run app and capture screenshots
Critique
- evaluate against rubric with vision model and/or critic agent
Revise
- apply one focused revision pass
Submit
- include screenshots and checklist results in PR notes

This turns "taste" into a repeatable quality loop.

7. Workflow redesign recommendations

7.1 Yes: require a text wireframe before code

Recommendation: strong yes.

Reason:

page-level mistakes happen before implementation starts
a text wireframe forces information architecture decisions
it gives a reviewable artifact for product/CEO/critic agents

Suggested format:

user role
top task
first-screen takeaway
section order
primary action
what is hidden by default
rationale for information priority

7.2 Yes: use a separate UX critic agent

Recommendation: yes, especially for page work.

Generation and critique are different cognitive modes. One agent asked to both produce and judge often rationalizes its own output.

The UX critic should review:

wireframe before implementation
screenshot after implementation
optional PR diff summary

The critic should not rewrite everything. It should answer:

what feels flat?
what feels overloaded?
what is visually over-emphasized or under-emphasized?
what sounds machine-written?
what should collapse or move below the fold?

7.3 Yes: require screenshots before submitting UI work

Recommendation: mandatory for page-level UI.

If the final review unit is code, you miss the real failure mode. If the final review unit is screenshots plus code, you catch hierarchy and composition errors much earlier.

Required artifacts for UI PRs:

before screenshot
after screenshot
desktop view
important state variations
short note: "main change in hierarchy / copy / action clarity"

7.4 Yes: build page-level templates

Recommendation: high priority.

Your current issue is exactly what happens when teams have component reuse without page composition standards.

Templates should encode:

hero summary area
side vs inline metadata patterns
evidence panel patterns
action rail patterns
table + summary pairings
escalation / urgency conventions

7.5 Yes: create a design language document agents must follow

Recommendation: yes, but make it operational.

Do not produce a purely aspirational brand/design document. Create a short design language document that contains:

visual tone
hierarchy rules
spacing rhythm rules
content principles
domain-specific copy guidance
examples and anti-patterns
page templates

Then reference it from the skill and from prompts.

7.6 Introduce explicit page success criteria

Every UI task should declare success criteria like:

primary action visible in first screen
severity/status/owner visible without reading all sections
no more than 3 primary information groups above fold
details progressively disclosed
labels rewritten in analyst language
screenshot review passes rubric

This aligns with Anthropic/OpenAI guidance that prompt engineering works best when success criteria and evaluations are explicit.

7.7 Separate component quality from page quality in review

Current failure mode likely comes from over-indexing on component correctness.

Use two review layers:

Component review: correctness, accessibility, consistency
Page review: hierarchy, flow, copy, task support, trust

A page can pass the first and still fail the second.

8. Quick wins (actionable this week)

Quick win 1: Add a mandatory pre-code page plan

For any page-level task, require the agent to output:

top user goal
first-screen takeaway
section hierarchy
what is collapsed
primary action

This is low effort and likely to improve page composition immediately.

Quick win 2: Create a one-page UX checklist for agents

Use a short rubric:

Is the main takeaway obvious in 5 seconds?
Is there one clear primary action?
Are there too many equal-weight sections?
Is metadata subordinate?
Is there progressive disclosure?
Do labels sound natural?
Does the page feel authoritative and calm?

Quick win 3: Require screenshots in every UI PR

No screenshot, no UI approval.

Quick win 4: Add a screenshot-critique step with a vision model

Have the agent render the page and ask for critique against the rubric above. One revision pass only.

Quick win 5: Build 3 page templates first

Start with the highest-frequency page types:

dashboard / overview
finding detail
triage table + detail context

Quick win 6: Build an internal before/after example library

Even 5–10 examples will help a lot. For each example, capture:

old screenshot
improved screenshot
notes on hierarchy, copy, disclosure, spacing, action clarity

Quick win 7: Encode anti-patterns in the system/skill prompt

Explicitly ban:

equal-weight card walls
full-data dumps above the fold
vague or machine-like labels
multiple competing primary CTAs
ungrouped metadata blocks

Quick win 8: Create a security-domain copy guide

A short doc with preferred wording for:

severity
confidence
owner
last seen
affected scope
evidence
review state
recommended action
exception / suppression

This will improve "human phrasing" faster than any model change.

Quick win 9: Introduce a UX critic agent for page tasks only

Start small. Do not slow every UI change. Use the critic for:

new pages
major page redesigns
high-visibility flows

Quick win 10: Track UX-specific defects separately

After PR review or CEO feedback, tag the issue type:

hierarchy
copy tone
too much visible detail
action ambiguity
spacing/rhythm
trust/authority mismatch

After 2–3 weeks, use those defect patterns to improve the UX skill.

Final recommendation

If I had to prioritize only three changes, I would do these first:

Pre-code wireframe/page plan
Post-render screenshot critique
OpenClaw UX skill with page templates + self-review checklist

Those three changes directly target the real issue: the agents are producing correct UI code without enough structure for page-level judgment.

The fix is not a more eloquent "make it beautiful" prompt. The fix is a workflow where the agent must:

plan the page,
implement inside explicit design constraints,
look at the rendered result,
and critique it before a human ever sees it.

Sources / evidence used

v0 docs: positioning around generating high-fidelity UI from prompts, screenshots, templates, design systems, repo sync, and PR/deploy flows
Cursor customer stories:
- Box: custom rules, frontend AI toolkit, faster React and design system migrations
- Salesforce: velocity/quality gains and trust-building via workflow adoption
Microsoft Fluent design system site: evidence of mature design-system support and tooling ecosystem
Anthropic prompt engineering overview: explicit success criteria and evals before prompt tuning
Anthropic vision docs: practical use of images and image-first prompting patterns
NN/g usability heuristics: especially match to real world, recognition over recall, aesthetic/minimalist design
Chrome Lighthouse docs: automated quality/accessibility/performance checks and CI support
Deque axe platform materials: accessibility tooling embedded into dev workflow
Storybook visual testing docs and Chromatic materials: snapshot baselines, visual regression testing, UI review workflows, explicit sign-off
Playwright screenshot comparison docs: page-level golden screenshot testing

Next Action

Status: research-complete — input to [[combined-ux-strategy]] Decision needed from: CTO (Ivan) See: [[combined-ux-strategy]] for synthesized recommendations and decision options

1. Executive Summary​

2. How teams ship good AI-generated UI today (real examples)​

2.1 The common pattern: AI inside a constrained system​

v0 / Vercel pattern​

2.2 Cursor customer pattern: rules + internal toolkit + standards​

Box​

Salesforce​

2.3 Design-system-first orgs win because AI has better constraints​

2.4 What teams that ship AI UI successfully actually do​

3. The taste gap — root causes and mitigations​

3.1 Root cause: models optimize local validity, not holistic UX​

3.2 Why "good Tailwind" is not the same as good design​

3.3 Mitigation: encode taste as decision rules, not aspirations​

3.4 Reference designs and screenshots are one of the strongest fixes​

3.5 Before/after examples are unusually valuable​

3.6 Vision models can help, but mostly as critics​

4. Practical prompt patterns that produce better UX​

4.1 Prompt pattern: require a page plan before code​

4.2 Prompt pattern: specify hierarchy rules numerically​

4.3 Prompt pattern: write for scanability, not completeness​

4.4 Prompt pattern: require progressive disclosure explicitly​

4.5 Prompt pattern: copy should sound like a human product designer​

4.6 Prompt pattern: include domain-specific UX goals​

4.7 Prompt pattern: use positive examples plus anti-patterns​

4.8 Prompt pattern: force self-critique before finishing​

5. UX review automation approaches​

5.1 Accessibility and quality tooling: useful but partial proxies​

Lighthouse​

axe / Deque​

5.2 Visual regression testing is essential for AI-generated UI​

Storybook + Chromatic​

Playwright snapshots​

5.3 AI-powered screenshot critique is the missing middle layer​

5.4 Heuristic review checklists can be automated well​

Security UX heuristic layer​

5.5 Suggested automation stack​

6. Proposed OpenClaw UX skill design​

6.1 What the skill should do​

6.2 Suggested skill contents​

A. Design principles as actionable rules​

B. Page-level templates​

C. Copywriting rules​

D. UX self-review checklist​

E. Reference screenshots / examples​

F. Post-render screenshot review step​

6.3 Proposed skill workflow​

7. Workflow redesign recommendations​

7.1 Yes: require a text wireframe before code​

7.2 Yes: use a separate UX critic agent​

7.3 Yes: require screenshots before submitting UI work​

7.4 Yes: build page-level templates​

7.5 Yes: create a design language document agents must follow​

7.6 Introduce explicit page success criteria​

7.7 Separate component quality from page quality in review​

8. Quick wins (actionable this week)​

Quick win 1: Add a mandatory pre-code page plan​

Quick win 2: Create a one-page UX checklist for agents​

Quick win 3: Require screenshots in every UI PR​

Quick win 4: Add a screenshot-critique step with a vision model​

Quick win 5: Build 3 page templates first​

Quick win 6: Build an internal before/after example library​

Quick win 7: Encode anti-patterns in the system/skill prompt​

Quick win 8: Create a security-domain copy guide​

Quick win 9: Introduce a UX critic agent for page tasks only​

Quick win 10: Track UX-specific defects separately​

Final recommendation​

Sources / evidence used​

Next Action​

1. Executive Summary

2. How teams ship good AI-generated UI today (real examples)

2.1 The common pattern: AI inside a constrained system

v0 / Vercel pattern

2.2 Cursor customer pattern: rules + internal toolkit + standards

Box

Salesforce

2.3 Design-system-first orgs win because AI has better constraints

2.4 What teams that ship AI UI successfully actually do

3. The taste gap — root causes and mitigations

3.1 Root cause: models optimize local validity, not holistic UX

3.2 Why "good Tailwind" is not the same as good design

3.3 Mitigation: encode taste as decision rules, not aspirations

3.4 Reference designs and screenshots are one of the strongest fixes

3.5 Before/after examples are unusually valuable

3.6 Vision models can help, but mostly as critics

4. Practical prompt patterns that produce better UX

4.1 Prompt pattern: require a page plan before code

4.2 Prompt pattern: specify hierarchy rules numerically

4.3 Prompt pattern: write for scanability, not completeness

4.4 Prompt pattern: require progressive disclosure explicitly

4.5 Prompt pattern: copy should sound like a human product designer

4.6 Prompt pattern: include domain-specific UX goals

4.7 Prompt pattern: use positive examples plus anti-patterns

4.8 Prompt pattern: force self-critique before finishing

5. UX review automation approaches

5.1 Accessibility and quality tooling: useful but partial proxies

Lighthouse

axe / Deque

5.2 Visual regression testing is essential for AI-generated UI

Storybook + Chromatic

Playwright snapshots

5.3 AI-powered screenshot critique is the missing middle layer

5.4 Heuristic review checklists can be automated well

Security UX heuristic layer

5.5 Suggested automation stack

6. Proposed OpenClaw UX skill design

6.1 What the skill should do

6.2 Suggested skill contents

A. Design principles as actionable rules

B. Page-level templates

C. Copywriting rules

D. UX self-review checklist

E. Reference screenshots / examples

F. Post-render screenshot review step

6.3 Proposed skill workflow

7. Workflow redesign recommendations

7.1 Yes: require a text wireframe before code

7.2 Yes: use a separate UX critic agent

7.3 Yes: require screenshots before submitting UI work

7.4 Yes: build page-level templates

7.5 Yes: create a design language document agents must follow

7.6 Introduce explicit page success criteria

7.7 Separate component quality from page quality in review

8. Quick wins (actionable this week)

Quick win 1: Add a mandatory pre-code page plan

Quick win 2: Create a one-page UX checklist for agents

Quick win 3: Require screenshots in every UI PR

Quick win 4: Add a screenshot-critique step with a vision model

Quick win 5: Build 3 page templates first

Quick win 6: Build an internal before/after example library

Quick win 7: Encode anti-patterns in the system/skill prompt

Quick win 8: Create a security-domain copy guide

Quick win 9: Introduce a UX critic agent for page tasks only

Quick win 10: Track UX-specific defects separately

Final recommendation

Sources / evidence used

Next Action