Page perception tools
Four tools let the AI read what the visitor is currently looking at. The widget’s bundle exposes these so the model can answer “what does this say?” without preloading every page of your site into the prompt.
read_page — full page scrape
Returns the title, meta description, every heading, navigation links, visible images (alt text + nearest heading), products if any, ctas, and up to 8 000 chars of main body text.
When the AI calls it: any time the visitor asks a content question about the current page — “what is this article about?”, “summarize this”, “tell me about this product”, “what’s on this page?”.
Parameters: none.
Returns: a structured text block:
PAGE: /menuTITLE: Dinner Menu — Ember & Oak
HEADINGS:H1: Dinner MenuH2: StartersH2: EntreesH2: DessertsH2: Wine list
SECTIONS:[starters] Starters — Roasted carrot soup; Heirloom tomato salad; ...[entrees] Entrees — Pan-seared trout; Braised short rib; ...
NAVIGATION:/ Home/menu Menu/reservations Reservations
CTAs:Book a table · Order online
MAIN: [up to 8 000 chars of body prose]The AI uses this to ground every spoken answer in your real page content. Without it the model has no idea what page the visitor is on.
read_viewport — what’s currently on screen
Returns prose that intersects the visitor’s current viewport — what they’re actually looking at right now. Sections scrolled past or below the fold are not included.
When the AI calls it: visitor says “read this”, “read what’s on screen”, “read it out”.
Parameters: none.
Returns: up to 2 400 chars of viewport-visible prose, joined with newlines. If the visible region exceeds the cap, the response ends with " …" so the AI knows to offer to continue.
The seasonal pre-fixe menu changes weekly based on what our farms harvest. Each course is paired with a wine from our cellar — see the wine list on page 2. Dietary restrictions can be accommodated with 24h notice. …read_viewport differs from read_page in scope:
read_page | read_viewport | |
|---|---|---|
| Scope | Whole document | Currently visible only |
| Char cap | 8 000 | 2 400 |
| Use case | ”What’s this page about?" | "Read this.” |
| When called | Once per page navigation | On explicit visitor request |
read_section — read one indexed section aloud
After search_knowledge_base returns a hit with {url, section_id}, the AI can ask to read that specific section in full. Triggers a persistent highlight + auto-scroll on the page while the agent reads it aloud.
When the AI calls it: after a knowledge-base search, when the visitor says “read me that section”, “read the part about X”.
Parameters:
{ "url": "/services/water-heater-installation", "section_id": "warranty"}Both fields are returned verbatim from the prior search_knowledge_base call.
Returns: up to 1 500 chars of section prose. Ends with " …" if truncated; AI offers to continue reading.
Side effects (browser-side):
- Fetches
/v1/<siteId>/read-sectionto get the full section text - Finds the element via
[data-spelo-section-id="..."]or[data-section-id="..."]orgetElementById - Calls
onSectionReadcallback — VoiceWidget mounts a persistent highlight overlay + (if section is taller than 90% of viewport) slowly auto-scrolls during the read
The visitor sees the highlight appear right where the agent is reading, even if they scroll away — the highlight tracks the actual section position.
see.snapshot — structured element grid
The keystone of the v2 action protocol. Returns an array of every interactive or textual element on the page with stable ids, roles, names, bounding boxes, and visibility.
When the AI calls it: before any action targeting a specific element — “click the third button”, “fill the email field”, “scroll to the next section”.
Parameters: none.
Returns: JSON array. Each entry:
{ "id": "sp-12", "role": "button", "name": "Book a table", "bbox": [620, 480, 140, 44], "visible": true}| Field | Meaning |
|---|---|
id | Stable element id — pass to act.click, act.fill, act.scroll_to. Persists across re-renders if the element keeps its data-spelo-id. |
role | button · link · checkbox · radio · textbox · select · heading · section · image · other |
name | Accessible name — aria-label, label[for], placeholder, or visible text |
bbox | [x, y, width, height] in viewport pixels |
visible | Whether the element has a non-zero box and is at least partially in the viewport |
value | (form fields only) current value |
type | (input fields only) text / email / tel / etc. |
Capped at 150 elements per snapshot to keep the LLM prompt under budget. If your page is denser, scroll triggers a fresh snapshot.
Why snapshot-based addressing matters
Compared to the legacy click_element({ text: "Submit" }):
- Icon buttons (heart, ×, ⋯ menu) — legacy can’t find them; snapshot does, because it reads accessible names
- Duplicate labels — three “Add to cart” buttons on a category page — legacy picks the first; snapshot lets the AI pick the right one by bbox / surrounding context
- Dynamic forms — after the AI fills a field that triggers more fields to appear, a fresh snapshot reflects the new state immediately
- Shadow DOM components — snapshot crosses into shadow trees that text-matching can’t reach
For these reasons, the system prompt instructs the AI to prefer the v2 path (see.snapshot → act.*) and use legacy tools only as fallback.
Performance and limits
| Tool | Char cap | Wall-clock cost |
|---|---|---|
read_page | 8 000 | DOM walk ~20-50 ms |
read_viewport | 2 400 | DOM walk ~10-20 ms |
read_section | 1 500 | + ~80-200 ms HTTP round-trip to read-section endpoint |
see.snapshot | 150 elements | DOM walk ~15-40 ms |
All four run in the visitor’s browser. There’s no Spelo server roundtrip except for read_section.
See also
- Site intelligence endpoint — the per-session metadata the widget loads on init (different from
read_pagewhich is per-page-view) - Knowledge & lifecycle tools —
search_knowledge_basepartners withread_section - Action tools — what the AI does AFTER it knows what’s on the page