Research: Making AI Coding Agents Produce Better UI/UX Outcomes
1. Executive Summary
The core problem is not that the agents cannot write UI code. It is that they are optimizing for code correctness and local component quality, while good product UI depends on page-level hierarchy, narrative flow, domain language, visual rhythm, and judgment.
That is the "taste gap": LLMs can generate syntactically valid React/Tailwind quickly, but they do not naturally prioritize information the way an experienced designer or product-minded frontend engineer would.
What successful teams do in practice is not "let the model freestyle the UI." They constrain it with:
- Design systems and component libraries
- Rules/guardrails encoded close to the code
- Examples, templates, and reference UIs
- Visual review loops using screenshots, Storybook, Chromatic, Playwright, and humans
- Workflow separation between generation and critique
- Evaluations against explicit criteria, not vibes alone
For your OpenClaw-based agents, the biggest gains will likely come from process changes, not smarter one-shot prompting:
- Require a text wireframe / page plan before coding
- Create a UX skill that encodes hierarchy, spacing, copywriting, and progressive-disclosure rules
- Require rendered screenshots before submission
- Add a separate UX critic pass on every UI task/PR
- Use page templates, not just reusable components
- Review pages against heuristics (scanability, information hierarchy, action clarity, evidence trust)
For a security product specifically, good UX should feel:
- Scannable under time pressure
- Authoritative, not playful
- Evidence-first when showing risk or findings
- Clear about urgency and next actions
- Trustworthy and accountable in how decisions are presented
The most important operational change: stop treating UI generation as "write components that satisfy the ticket" and start treating it as "design a page experience, then implement it."
2. How teams ship good AI-generated UI today (real examples)
2.1 The common pattern: AI inside a constrained system
The strongest pattern across tools is that teams succeed when AI is used inside a design and review system, not as a replacement for one.
v0 / Vercel pattern
From v0 docs:
- v0 positions itself as an AI agent that can create real code and full-stack apps
- It explicitly supports creating high-fidelity UIs from wireframes/mockups
- It emphasizes templates, design systems, live preview, repo sync, and pull requests
- It is used by designers to clone pages from screenshots/Figma and by engineers to scaffold components quickly
Implication: even Vercel’s own framing is not "prompt a beautiful app from nothing." It is:
- start from a visual reference, template, or existing system
- generate quickly
- refine visually
- connect to existing code and review flows
That matters for your team: reference inputs and system constraints are first-class, not optional.
2.2 Cursor customer pattern: rules + internal toolkit + standards
Box
Cursor’s Box case study is especially relevant.
Reported outcomes:
- 85%+ of developers use Cursor daily
- 30–50% increase in roadmap throughput
- React migration completed ~80% faster than expected
- large design system migration completed ~90% faster than expected
But the key detail is not speed. It is method:
- Box built a standard "AI toolkit" for frontend development using custom Cursor rules
- They defined agent guardrails directly in code
- They used rules so Cursor could understand exactly how components should be structured
This is highly relevant to OpenClaw skills. The lesson is:
- don’t leave taste in a wiki doc
- encode it into the agent’s operating context
- keep it near the code and near the generation step
Salesforce
Cursor’s Salesforce case study shows a different but useful pattern:
- teams measured cycle time, quality, and throughput
- adoption started with boring tasks, then expanded as trust grew
- quality gains came partly from more generated tests and broader SDLC usage
Implication for UI work:
- teams do not adopt agent-generated UI by trusting it immediately
- they create trust through measurable guardrails and iterative scope expansion
2.3 Design-system-first orgs win because AI has better constraints
Microsoft’s Fluent design system is a reminder that mature orgs ship coherent UX because they provide:
- component standards
- content guidance
- accessibility resources
- cross-platform patterns
- design tooling and developer tooling
AI benefits disproportionately from this kind of structure. A human designer can compensate for weak constraints with judgment. An LLM usually cannot.
So the practical takeaway is:
- a strong design system is necessary but insufficient
- AI also needs page-level composition rules, copy rules, and examples of good hierarchy
2.4 What teams that ship AI UI successfully actually do
Across these examples, the recurring success pattern is:
- Use AI to draft, not to decide everything
- Anchor output in an existing design language
- Encode standards as rules or reusable prompts
- Generate from references (wireframes, screenshots, Figma, templates)
- Review rendered output visually, not just code
- Use CI checks for regressions and accessibility
In other words: good AI-generated UI is usually AI-accelerated systematized UI, not raw model taste.
3. The taste gap — root causes and mitigations
3.1 Root cause: models optimize local validity, not holistic UX
The symptoms you described map well to typical LLM failure modes:
- Vertical data dump → the model serializes all requirements into sections
- Flat typography → the model knows semantic HTML but lacks judgment about emphasis
- Correct but lifeless spacing → it applies utility classes mechanically
- Accurate but unnatural labels → it mirrors domain input or ticket language too literally
- No progressive disclosure → it fears omission more than overload
- No page story → it treats pages as containers, not guided experiences
- Components don’t compose → it solves each block independently
This happens because LLMs are usually rewarded for:
- completeness
- correctness
- explicitness
- satisfying every requirement mentioned
But good UX often requires:
- omission
- prioritization
- grouping
- implied hierarchy
- pacing
- editorial judgment
3.2 Why "good Tailwind" is not the same as good design
Agents often know:
space-y-6text-sm text-muted-foregroundrounded-lg border- grid and flex patterns
But good UX depends on:
- where visual emphasis should accumulate
- what should be visible in the first 5 seconds
- what metadata should be hidden or subordinate
- which actions deserve primacy
- how sections should ladder from summary to evidence to action
Tailwind literacy solves implementation. It does not solve composition.
3.3 Mitigation: encode taste as decision rules, not aspirations
A vague instruction like "make it polished" does not travel well.
A much better approach is to encode taste into explicit rules such as:
- Every page must have a primary takeaway above the fold
- No more than 3 visually competing sections on initial load
- Each page must define: headline, status summary, evidence, recommended action
- Metadata should default to secondary styling and/or collapsible containers
- Use only one primary action per screen state
- Prefer summary first, details on demand
- Headings must create a clear size and weight ladder
- Avoid walls of equal cards or equal sections
This is the difference between "taste" and "operationalized taste."
3.4 Reference designs and screenshots are one of the strongest fixes
v0 explicitly supports generating from wireframes/mockups/screenshots. That is consistent with a broader reality: models do better when they can imitate visual relationships instead of inventing them from abstract prose.
Practical implications:
- Give the agent screenshots of pages whose hierarchy you like
- Give before/after examples of improved pages
- Keep a small reference set for dashboards, detail pages, tables, risk reviews, and workflows
- Ask the model to explain which structural properties it is reusing
Good reference use is not copying aesthetics blindly. It is constraining:
- density
- grouping
- rhythm
- heading hierarchy
- action placement
- disclosure patterns
3.5 Before/after examples are unusually valuable
If you want better page composition, examples of what changed are often better than rules alone.
Why:
- they demonstrate what "too flat" looks like
- they show how details are demoted without being lost
- they expose better wording for headings and labels
- they clarify what "scanable" means in practice
For this team, a useful artifact would be a small internal library of:
- bad AI page → improved page
- with notes on why the improved version works
This can become training data for prompts, review checklists, and future skills.
3.6 Vision models can help, but mostly as critics
Vision-capable models are useful because they can evaluate the rendered result, which is where many UX failures become obvious.
Anthropic’s vision docs reinforce practical constraints:
- images work best when clear and legible
- image-first prompting often helps
- multiple images can be compared in one request
Strong use cases for vision in your workflow:
- critique a screenshot of the rendered page
- compare current page vs reference screenshot
- compare before/after screenshots and explain improvement
- detect flat hierarchy, dense blocks, weak CTA emphasis, clutter
Weak use cases:
- using a vision model as the sole final arbiter of quality
- expecting it to replace human product judgment
Best role: AI as a first-pass critic before human review.
4. Practical prompt patterns that produce better UX
The highest-leverage prompt improvements are not about adding adjectives. They are about forcing the model to make product/design decisions explicitly.
4.1 Prompt pattern: require a page plan before code
Instead of:
Build the findings detail page in React and Tailwind.
Use:
Before writing code, produce a page plan with:
- primary user goal
- primary takeaway visible in first screen
- information hierarchy from most important to least important
- what is hidden by default
- primary action and secondary actions
- section list with one-line purpose for each section Then implement only after the plan is approved or self-checked.
Why it works:
- forces prioritization
- reduces the tendency to dump all requirements onto the page
- creates an artifact a critic agent can review
4.2 Prompt pattern: specify hierarchy rules numerically
Agents respond better to concrete hierarchy constraints than abstract style language.
Example:
Create strong visual hierarchy.
- One clear page title
- One summary band above the fold
- Max 3 primary information groups before scrolling
- Metadata must be visually subordinate to interpretation
- If everything looks equally important, reduce emphasis until only the top message dominates
For typography:
Use a clear hierarchy:
- page title: largest and boldest
- section headings: clearly smaller than title but distinctly stronger than body
- labels/meta: smaller and muted
- avoid using the same font size/weight for heading, value, and explanatory text
4.3 Prompt pattern: write for scanability, not completeness
Example:
Write the page so a security analyst can understand the situation in 5–10 seconds.
- lead with outcome, severity, confidence, owner, and next action
- prefer short labels and plain language
- summarize before listing evidence
- do not make the user read every section to understand the issue
This directly addresses your problem that pages have no story.
4.4 Prompt pattern: require progressive disclosure explicitly
Example:
Do not display all details at once. Default to summary-first. Put secondary data in:
- collapsible sections
- tabs
- drawers
- "show details" expansions Surface deeper evidence only where it supports a decision.
Without this instruction, many agents assume all available data should be visible.
4.5 Prompt pattern: copy should sound like a human product designer
Example:
Write labels and helper text the way a human would scan them, not the way a schema or backend field would name them. Prefer:
- "Needs review" over "Review status: pending analyst disposition"
- "Last seen" over "Most recent observation timestamp"
- "Why this matters" over "Risk explanation"
This is especially important in security products, where domain language easily becomes stiff or bureaucratic.
4.6 Prompt pattern: include domain-specific UX goals
For your product, prompts should include security-specific constraints such as:
This is a security operations product. The UI should feel:
- calm under pressure
- evidence-based
- trustworthy
- accountable
- optimized for triage and review
Prioritize:
- severity and confidence visibility
- ownership clarity
- evidence traceability
- action history / accountability
- minimizing cognitive load under alert fatigue
This is critical. Generic SaaS UI prompts often produce friendly dashboard UIs, not authoritative security workflows.
4.7 Prompt pattern: use positive examples plus anti-patterns
Negative examples are useful when tied to specific failure modes.
Example anti-pattern section in the prompt:
Avoid these common failures:
- long vertical stacks of equal cards
- headings and metadata with the same visual weight
- showing every field just because data exists
- tables without summary interpretation
- technically correct labels that sound machine-generated
- pages where the primary action is unclear
Then pair with positive examples:
Prefer:
- summary banner + prioritized sections
- one dominant action area
- evidence grouped under clear questions
- labels that sound like analyst workflow language
In practice, positive examples plus named anti-patterns work better than either alone.
4.8 Prompt pattern: force self-critique before finishing
Example:
Before finalizing, review the UI against this checklist and revise once:
- Can a user grasp the page in 5 seconds?
- Is the main takeaway obvious above the fold?
- Are there more than 3 competing primary sections?
- Is metadata visually subordinate?
- Is there progressive disclosure?
- Do labels sound natural?
- Is the next action obvious?
This is cheap and often improves results materially.
5. UX review automation approaches
Automation will not fully solve UX quality, but it can catch a lot of what your current workflow is missing.
5.1 Accessibility and quality tooling: useful but partial proxies
Lighthouse
Lighthouse is an automated tool for performance, accessibility, SEO, and general page quality. It can run in DevTools, CLI, or CI, and Lighthouse CI can prevent regressions.
What it helps with:
- accessibility regressions
- performance problems
- some UX-adjacent quality issues
What it does not solve:
- information hierarchy
- poor narrative structure
- awkward labels
- weak visual rhythm
Use it as a floor, not a substitute for UX review.
axe / Deque
Axe helps teams automate accessibility testing and integrate checks into development workflows.
What it helps with:
- catching many accessibility issues early
- embedding checks in IDE/build/test pipelines
Again, accessibility is necessary, but not sufficient for good UX.
5.2 Visual regression testing is essential for AI-generated UI
Storybook + Chromatic
Storybook visual tests and Chromatic are especially relevant because they create baselines for every story and detect UI regressions automatically.
Why this matters for AI agents:
- AI can unintentionally alter spacing, hierarchy, state styling, and interaction affordances
- visual diffing catches regressions that code review misses
- Chromatic also supports explicit sign-off and shared UI context, which is valuable for multi-agent systems
Chromatic explicitly frames itself as enforcing UI standards even when AI writes code.
This is a strong fit for your setup if you have Storybook or can add it for critical components/pages.
Playwright snapshots
Playwright’s toHaveScreenshot() provides page- and component-level visual comparison.
Best use here:
- page-level screenshots for top workflows
- golden snapshots for key states: empty, normal, high severity, overloaded, error
- same environment for stable rendering
5.3 AI-powered screenshot critique is the missing middle layer
This is probably the most practical automation addition for your current process.
Workflow:
- Render the page locally or in preview
- Capture screenshot(s)
- Send screenshot plus checklist to a vision-capable model
- Ask for critique specifically on:
- hierarchy
- scanability
- copy clarity
- spacing rhythm
- CTA emphasis
- progressive disclosure
- trust/authority fit for security domain
- Feed the critique back into one revision pass
This is not a pixel-perfect evaluator. It is a heuristic critic.
That’s still valuable, because your current problem is largely heuristic and compositional.
5.4 Heuristic review checklists can be automated well
NN/g’s 10 heuristics are still useful as a base layer:
- match between system and real world
- consistency and standards
- recognition rather than recall
- aesthetic and minimalist design
- visibility of system status
- error prevention
- user control and freedom
For your use case, I would adapt these into a security-product review checklist:
Security UX heuristic layer
- Can the user identify severity, scope, confidence, owner, and next step quickly?
- Is evidence separate from interpretation, but easy to traverse between them?
- Are urgent items visually prominent without making the whole screen scream?
- Are actions reversible or safely confirmed where appropriate?
- Does the page support accountable review (who changed what, why, when)?
- Is dense detail available, but not forced into the first scan?
These checks can be run by a critic agent on every page task.
5.5 Suggested automation stack
A practical stack for your team:
Baseline gates
- Typecheck / tests
- Lighthouse
- axe
Visual gates
- Playwright screenshot tests for key pages
- Chromatic/Storybook for component and state regressions
AI critique gates
- Vision-model screenshot review with structured rubric
- optional separate UX critic agent review
Human gates
- design/product review on major page changes
6. Proposed OpenClaw UX skill design
The biggest opportunity is to convert good design judgment into a reusable operational skill.
6.1 What the skill should do
The UX skill should not just say "make it polished." It should drive a workflow.
Suggested responsibilities:
- interpret the UI task in terms of user goal and page story
- force a wireframe/structure pass before coding
- apply layout and copy rules during implementation
- require post-render screenshot review
- run a self-critique checklist
- output revision suggestions if the page still feels flat
6.2 Suggested skill contents
A. Design principles as actionable rules
Examples:
- Lead with the most decision-relevant information
- Do not present more than 3 competing primary regions above the fold
- Prefer summary → evidence → action
- Demote metadata unless it changes user action
- Every page needs one obvious primary action
- Every section needs a reason to exist
- Use consistent heading/value/meta hierarchy
- Avoid equal-weight card stacks
- Use progressive disclosure for secondary detail
B. Page-level templates
This is crucial. Component libraries are not enough.
Templates should exist for common security product page types:
- dashboard / overview
- finding detail page
- asset/entity detail page
- investigation timeline
- exception/review workflow page
- queue/table triage page
- policy/rule detail page
Each template should define:
- above-the-fold structure
- default section order
- what gets summary treatment
- what gets collapsed
- typical actions
- trust/evidence patterns
C. Copywriting rules
Examples:
- headings should answer user questions, not mirror backend models
- labels should be short and skimmable
- status text should be concrete and active
- avoid jargon unless users genuinely use it
- helper text should explain implications, not restate labels
D. UX self-review checklist
Minimum checklist:
- What is the main takeaway?
- Is it visible without scrolling?
- What is visually dominant, and is that correct?
- What can be hidden by default?
- Is the next action obvious?
- Do section names sound natural?
- Are evidence and action clearly connected?
- Does the page feel trustworthy and calm?
E. Reference screenshots / examples
The skill should point to a small local library of:
- good dashboard examples
- good detail page examples
- before/after internal refactors
- examples of progressive disclosure
- examples of strong typography hierarchy
F. Post-render screenshot review step
The skill should require:
- render the page
- capture desktop screenshot and maybe narrow viewport screenshot
- ask a critic prompt to review the rendered output
- revise once before considering task complete
6.3 Proposed skill workflow
A good OpenClaw UX skill could enforce this sequence:
- Understand task
- identify page type, user, primary action
- Plan UX
- create text wireframe and section hierarchy
- Implement
- code using approved template/design system
- Render
- run app and capture screenshots
- Critique
- evaluate against rubric with vision model and/or critic agent
- Revise
- apply one focused revision pass
- Submit
- include screenshots and checklist results in PR notes
This turns "taste" into a repeatable quality loop.
7. Workflow redesign recommendations
7.1 Yes: require a text wireframe before code
Recommendation: strong yes.
Reason:
- page-level mistakes happen before implementation starts
- a text wireframe forces information architecture decisions
- it gives a reviewable artifact for product/CEO/critic agents
Suggested format:
- user role
- top task
- first-screen takeaway
- section order
- primary action
- what is hidden by default
- rationale for information priority
7.2 Yes: use a separate UX critic agent
Recommendation: yes, especially for page work.
Generation and critique are different cognitive modes. One agent asked to both produce and judge often rationalizes its own output.
The UX critic should review:
- wireframe before implementation
- screenshot after implementation
- optional PR diff summary
The critic should not rewrite everything. It should answer:
- what feels flat?
- what feels overloaded?
- what is visually over-emphasized or under-emphasized?
- what sounds machine-written?
- what should collapse or move below the fold?
7.3 Yes: require screenshots before submitting UI work
Recommendation: mandatory for page-level UI.
If the final review unit is code, you miss the real failure mode. If the final review unit is screenshots plus code, you catch hierarchy and composition errors much earlier.
Required artifacts for UI PRs:
- before screenshot
- after screenshot
- desktop view
- important state variations
- short note: "main change in hierarchy / copy / action clarity"
7.4 Yes: build page-level templates
Recommendation: high priority.
Your current issue is exactly what happens when teams have component reuse without page composition standards.
Templates should encode:
- hero summary area
- side vs inline metadata patterns
- evidence panel patterns
- action rail patterns
- table + summary pairings
- escalation / urgency conventions
7.5 Yes: create a design language document agents must follow
Recommendation: yes, but make it operational.
Do not produce a purely aspirational brand/design document. Create a short design language document that contains:
- visual tone
- hierarchy rules
- spacing rhythm rules
- content principles
- domain-specific copy guidance
- examples and anti-patterns
- page templates
Then reference it from the skill and from prompts.
7.6 Introduce explicit page success criteria
Every UI task should declare success criteria like:
- primary action visible in first screen
- severity/status/owner visible without reading all sections
- no more than 3 primary information groups above fold
- details progressively disclosed
- labels rewritten in analyst language
- screenshot review passes rubric
This aligns with Anthropic/OpenAI guidance that prompt engineering works best when success criteria and evaluations are explicit.
7.7 Separate component quality from page quality in review
Current failure mode likely comes from over-indexing on component correctness.
Use two review layers:
- Component review: correctness, accessibility, consistency
- Page review: hierarchy, flow, copy, task support, trust
A page can pass the first and still fail the second.
8. Quick wins (actionable this week)
Quick win 1: Add a mandatory pre-code page plan
For any page-level task, require the agent to output:
- top user goal
- first-screen takeaway
- section hierarchy
- what is collapsed
- primary action
This is low effort and likely to improve page composition immediately.
Quick win 2: Create a one-page UX checklist for agents
Use a short rubric:
- Is the main takeaway obvious in 5 seconds?
- Is there one clear primary action?
- Are there too many equal-weight sections?
- Is metadata subordinate?
- Is there progressive disclosure?
- Do labels sound natural?
- Does the page feel authoritative and calm?
Quick win 3: Require screenshots in every UI PR
No screenshot, no UI approval.
Quick win 4: Add a screenshot-critique step with a vision model
Have the agent render the page and ask for critique against the rubric above. One revision pass only.
Quick win 5: Build 3 page templates first
Start with the highest-frequency page types:
- dashboard / overview
- finding detail
- triage table + detail context
Quick win 6: Build an internal before/after example library
Even 5–10 examples will help a lot. For each example, capture:
- old screenshot
- improved screenshot
- notes on hierarchy, copy, disclosure, spacing, action clarity
Quick win 7: Encode anti-patterns in the system/skill prompt
Explicitly ban:
- equal-weight card walls
- full-data dumps above the fold
- vague or machine-like labels
- multiple competing primary CTAs
- ungrouped metadata blocks
Quick win 8: Create a security-domain copy guide
A short doc with preferred wording for:
- severity
- confidence
- owner
- last seen
- affected scope
- evidence
- review state
- recommended action
- exception / suppression
This will improve "human phrasing" faster than any model change.
Quick win 9: Introduce a UX critic agent for page tasks only
Start small. Do not slow every UI change. Use the critic for:
- new pages
- major page redesigns
- high-visibility flows
Quick win 10: Track UX-specific defects separately
After PR review or CEO feedback, tag the issue type:
- hierarchy
- copy tone
- too much visible detail
- action ambiguity
- spacing/rhythm
- trust/authority mismatch
After 2–3 weeks, use those defect patterns to improve the UX skill.
Final recommendation
If I had to prioritize only three changes, I would do these first:
- Pre-code wireframe/page plan
- Post-render screenshot critique
- OpenClaw UX skill with page templates + self-review checklist
Those three changes directly target the real issue: the agents are producing correct UI code without enough structure for page-level judgment.
The fix is not a more eloquent "make it beautiful" prompt. The fix is a workflow where the agent must:
- plan the page,
- implement inside explicit design constraints,
- look at the rendered result,
- and critique it before a human ever sees it.
Sources / evidence used
- v0 docs: positioning around generating high-fidelity UI from prompts, screenshots, templates, design systems, repo sync, and PR/deploy flows
- Cursor customer stories:
- Box: custom rules, frontend AI toolkit, faster React and design system migrations
- Salesforce: velocity/quality gains and trust-building via workflow adoption
- Microsoft Fluent design system site: evidence of mature design-system support and tooling ecosystem
- Anthropic prompt engineering overview: explicit success criteria and evals before prompt tuning
- Anthropic vision docs: practical use of images and image-first prompting patterns
- NN/g usability heuristics: especially match to real world, recognition over recall, aesthetic/minimalist design
- Chrome Lighthouse docs: automated quality/accessibility/performance checks and CI support
- Deque axe platform materials: accessibility tooling embedded into dev workflow
- Storybook visual testing docs and Chromatic materials: snapshot baselines, visual regression testing, UI review workflows, explicit sign-off
- Playwright screenshot comparison docs: page-level golden screenshot testing
Next Action
Status: research-complete — input to [[combined-ux-strategy]]
Decision needed from: CTO (Ivan)
See: [[combined-ux-strategy]] for synthesized recommendations and decision options