How do you handle SSO authentication in browser automation?

The automation goes through the SSO flow programmatically. Usually this involves capturing session state once with a real human and reusing it.

Browser Automation for RAG: When the System You Need Has No API

Q: Is browser automation against the ToS of most SaaS tools?

It varies. Some tools prohibit it, some are silent, some allow it. Read the specific ToS. Most vendors care about external competitive scraping, not internal automation by paying customers.

Q: What happens when the vendor finally ships an API?

Swap the extraction layer to use the API and remove the browser automation. The storage, indexing, and retrieval layers stay the same.

Every company building serious AI hits the same wall. The data the model needs to be useful is sitting in a SaaS tool with no API, a legacy internal portal, or a vendor system that locked down its API behind enterprise pricing. The model is brilliant. The wrapper is fine. The retrieval layer is empty because the data cannot get out.

Browser automation is the underrated solution. Not for competitor scraping (covered in a separate piece on competitor PDP scraping), but for the much more common operator use case: a company employee with valid credentials needs an AI assistant that can read the data they can read. The model needs the same view. Browser automation gets it that view.

The Specific Problem This Solves

Imagine a property management company. The leasing data, tenant records, and historical notes all live inside a SaaS platform their staff uses daily. The platform has no exposed API. The staff log in, click through screens, read information, and act on it. The company wants an AI assistant that helps staff find information faster, draft communications, and flag anomalies. The assistant needs the tenant data to be useful.

Three options:

Wait for the SaaS vendor to ship an API. Could be never.
Pay for an enterprise tier that exposes an API. Often eight figures annually for a mid-size company. Usually not worth it.
Use browser automation with the staff's existing credentials to extract the data the assistant needs. The data was already accessible to humans; now it is accessible to the model.

Option 3 is what actually ships. The other two are wishful thinking.

The Architecture That Works

1. The Authenticated Browser Session

A headless browser (Playwright or Puppeteer) runs on a server inside the company's infrastructure. It logs into the SaaS tool using credentials stored in a secrets manager. The session persists across runs to avoid re-authenticating constantly.

Critical: the session uses a service account the company controls, not a personal account. The service account has appropriate role-based access, the same scope a junior staff member would have, no more. If the SaaS tool offers SSO with a per-account auth, the service account follows the same SSO flow. The point is to use legitimate access, not to bypass authentication.

2. The Extraction Layer

The browser navigates to the specific pages where the data lives. For each page, the automation extracts the structured data, not the rendered HTML. This means:

Reading DOM elements with stable selectors (data attributes are best, CSS classes are second, layout positions are last resort)
Downloading attached documents (lease PDFs, statements, reports) to a secure store
Capturing metadata, timestamps, owner attributions, anything that helps the model later reason about the data

The output is a normalized record per entity, not a screenshot. Structured records are queryable; screenshots are not.

3. The Storage and Indexing Layer

Extracted records flow into the company's own data store. Postgres for structured records, object storage for downloaded documents, a vector database for embedded text content. This is your data now. The SaaS vendor cannot pull it back if the relationship ends.

From here, the data feeds into the same organizational knowledge brain pattern, with retrieval over the extracted records joined to other company knowledge sources.

4. The Schedule and Drift Detection

The automation runs on a schedule (daily for most use cases, hourly for high-velocity data). Each run logs how many records were extracted, how that compares to the previous run, and whether the page layout has changed in ways that would break extraction. Schema drift is the silent killer of browser automation pipelines. Alerting on it the moment it happens keeps the pipeline reliable.

The Legal and Ethical Frame

Browser automation in the use case described here is fundamentally different from competitor scraping. The company is automating its own use of a tool it pays for, with credentials it owns, accessing data it already has the right to access. This is not a CFAA issue. It is a productivity tool.

That said, two caveats worth raising with your lawyer:

Vendor terms of service. Some SaaS vendors explicitly prohibit automation in their ToS. Enforceability varies. Read the ToS, make a judgment, and document the rationale.
Data export limits. If the vendor's ToS limits exports, automating around that limit is a meaningful contract risk. The right answer is usually to negotiate a higher limit with the vendor, not to ignore it.

For the bulk of internal-tool browser automation, the legal posture is clean. The technical and operational risks are higher than the legal ones.

The Failure Modes

Page Layouts That Change Constantly

The SaaS vendor pushes a UI update and your extraction selectors break. Mitigations: prefer data attributes over CSS classes, prefer text content matching over positional indexing, and run sanity checks on every extraction (record count, expected field presence). When something breaks, you know within hours, not weeks.

Rate Limits and Bot Detection

The vendor's infrastructure may treat aggressive automation as a bot. Throttle requests, randomize timing, and behave like a human session. If the tool has API rate limits exposed elsewhere, respect them.

Credential Risk

The credentials used by the automation are valuable. They get rotated regularly, stored in a secrets manager (not in code, not in a config file), and the service account they belong to has the minimum permissions needed for the extraction. Compromise of these credentials is a serious incident.

Treating It as Permanent Infrastructure

The vendor might ship an API next quarter. The right architecture treats browser automation as a temporary bridge. When a real API ships, swap the extraction layer to use the API. The storage, indexing, and retrieval layers stay the same.

When Browser Automation Is the Wrong Answer

If the vendor has a real API that does what you need, use the API. If the data you need is purely public web data, the external scraping pattern is different. If the volume of extraction is small enough that an employee could realistically do it manually, a workflow tool with a human in the loop is simpler.

Browser automation is for the case where the data is genuinely needed, the vendor has no API, the volume is high enough that humans cannot do it, and the company has clean legal grounds to automate its own access.

Frequently Asked Questions

Is browser automation against the ToS of most SaaS tools?

It varies. Some tools explicitly prohibit it, some are silent, some explicitly allow it. Read the specific ToS. The bigger question is whether the vendor enforces it; in practice, most vendors care about competitive scraping by outsiders, not internal automation by paying customers.

What tools do you use for browser automation?

Playwright is the current default. Puppeteer is an older alternative that still works. Both are mature, well-supported, and run headless reliably. The choice between them is preference more than capability at this point.

How do you handle authentication when the tool uses SSO?

The automation goes through the SSO flow programmatically. For most providers, this involves clicking through the SSO consent screen once with a real human, capturing the session state, and reusing the state in the automation. For some providers, programmatic OAuth flows are cleaner.

What happens when the vendor finally ships an API?

You swap the extraction layer to use the API and remove the browser automation. The storage, indexing, and retrieval layers stay. The work was not wasted; the data was useful in the interim.

Can this be done at consumer scale?

Mostly no. Browser automation has unit-economics that work for internal extraction (one company's data, scheduled, low cost per extraction). It does not work for consumer products that need to extract data from arbitrary third-party accounts at runtime.

If the AI you want to build is bottlenecked on data trapped in a SaaS tool, browser automation is probably the answer. Book a discovery call if you want to walk through your specific stack.

All posts TALK TO US