Browser Automation¶

Arc opens a real Chromium browser and interacts with pages like a human — using accessibility tree snapshots, not screenshots.

How It Works¶

Each page is converted to a numbered list of interactive elements:

[3] textbox "Where from?" value="Delhi"
[5] combobox "Where to?"
[7] textbox "Departure" value=""
[9] button "Search"

The LLM sees this, decides what to do, and sends actions:

{"actions": [
  {"type": "fill", "target": "[5]", "value": "Mumbai"},
  {"type": "fill", "target": "[7]", "value": "2026-04-10"},
  {"type": "click", "target": "[9]"}
]}

Browser Tools¶

Tool	Description
`browser_go`	Navigate to a URL, get a structured page snapshot
`browser_look`	Re-examine the current page
`browser_act`	Click, fill, scroll, submit — all in one call

Capabilities¶

Fills forms (text, dropdowns, comboboxes, date pickers)
Handles autocomplete suggestions (Google Flights, Amazon, etc.)
Navigates calendars, picks dates, closes overlays
Deals with CAPTCHAs by asking you to solve them, then continues
Persistent browser profiles (cookies/sessions survive restarts)
Switches between headless and headed mode for human assist

Under the Hood¶

The engine handles complex interactions mechanically:

Autocomplete — types char-by-char, waits for dropdown, picks best match using word-boundary-aware scoring
Calendars — detects calendar type (data-iso, aria-label, gridcell), navigates months, clicks the right day
Overlays — escalating click fallbacks (normal → force → JS → mouse coordinates)
CAPTCHAs — detected and escalated to human, then continues where it left off

Why Accessibility Tree?¶

	Screenshot + Vision	Accessibility Tree
Speed	Slow (image encoding + vision model)	Fast (text only)
Cost	Expensive (vision tokens)	Cheap (text tokens)
Accuracy	Approximate coordinates	Exact element targeting
Works with	Vision-capable models only	Any LLM