Skip to content

ADR: Universal Adapter for Municipal Data Extraction

Status: Accepted Date: 2026-03-13

Decision

Introduce a declarative extraction config interpreted by a generic UniversalExtractor client. An LLM generates the config at onboard time by inspecting sample HTML; extraction itself is deterministic — no LLM calls at runtime. Platform-specific clients remain for platforms with stable APIs (Legistar, CivicClerk, eScribe). The universal adapter targets HTML-rendered platforms that lack APIs: custom municipal sites, Simbli variants, and the long tail.

Context

CivicOS has 6 platform-specific extraction clients covering the major meeting platforms:

Client Approach Stability
Legistar REST API (OData) High — structured API
CivicClerk REST API (OData) High — structured API
eScribe JSON API (AJAX) High — structured API
Granicus HTML scraping + LLM column map Medium — LLM generates config once
Simbli Playwright + hardcoded regex/CSS Low — brittle to DOM changes
ProudCity HTML scraping (bespoke) Medium — custom selectors

The problem: Adding a new platform requires writing a new client class. Major cities like NYC run custom platforms. Simbli's regex-based parsing breaks across districts. We need a way to onboard arbitrary HTML-based meeting pages without writing bespoke code each time.

Existing proof of concept: Granicus's generate_column_map() already uses an LLM to infer table structure from sample HTML, then extracts deterministically using the inferred mapping. The universal adapter generalizes this pattern beyond tables.

Architecture

Overview

┌─────────────────────────────────────────────────┐
│                   Onboard Time                   │
│                                                  │
│  URL → fetch HTML → LLM infers config → validate │
│         (Playwright)    (structured output)       │
│                          │                        │
│                          ▼                        │
│               ExtractionConfig.metadata           │
│               {                                   │
│                 "adapter": { ... }                │
│               }                                   │
└─────────────────────────────────────────────────┘
                       │  saved to data/extraction/{jurisdiction}.json
┌─────────────────────────────────────────────────┐
│                 Extraction Time                   │
│                                                  │
│  URL → fetch HTML → apply config → Meeting[]     │
│         (Playwright    (CSS selectors,            │
│          or requests)   date parsing,             │
│                         pagination)               │
│                                                  │
│  No LLM calls. Deterministic. Auditable.         │
└─────────────────────────────────────────────────┘

Adapter Config Schema

The LLM generates a declarative config stored in ExtractionConfig.metadata["adapter"]:

{
  "adapter": {
    "version": 1,
    "page_type": "table" | "list" | "card",
    "listing": {
      "url_template": "https://example.gov/meetings?page={page}",
      "container": "table.meetings-list",
      "row": "tbody tr",
      "fields": {
        "title": { "selector": "td:nth-child(1)", "extract": "text" },
        "date": { "selector": "td:nth-child(2)", "extract": "text", "date_format": "%B %d, %Y" },
        "time": { "selector": "td:nth-child(3)", "extract": "text" },
        "agenda_url": { "selector": "td:nth-child(4) a", "extract": "href" },
        "minutes_url": { "selector": "td:nth-child(5) a", "extract": "href" },
        "video_url": { "selector": "td:nth-child(6) a", "extract": "href" }
      }
    },
    "pagination": {
      "type": "none" | "next_link" | "page_param" | "load_more",
      "next_selector": "a.next-page",
      "max_pages": 10
    },
    "detail": {
      "url_field": "agenda_url",
      "fields": {
        "title": { "selector": "h1", "extract": "text" },
        "time": { "selector": "time.datetime", "extract": "text" },
        "location": { "selector": "strong", "extract": "text" },
        "video_url": { "selector": "a[href*='youtube.com']", "extract": "href" }
      }
    },
    "requires_javascript": false,
    "provenance": {
      "sample_url": "https://example.gov/meetings",
      "sample_html_hash": "sha256:abc123...",
      "generated_at": "2026-03-13T10:00:00Z",
      "prompt_version": "universal_adapter/v2"
    }
  }
}

Required fields: title and date in the listing. All others optional.

Two-level extraction: Many municipal sites have thin listing pages (just links) with rich detail pages. The optional detail section tells the adapter to follow each listing link and extract additional fields (time, location, video). The LLM generates both configs at onboard time by sampling one listing page and one detail page. Extraction remains deterministic.

extract modes: - "text".get_text(strip=True) - "href"["href"], resolved to absolute URL - "attr:NAME" — arbitrary attribute - "html" — inner HTML (for rich content)

page_type variants: - "table" — Standard HTML table (container is <table>, row is <tr>) - "list"<ul>/<ol> or <div> list (container is wrapper, row is list item) - "card" — Repeated <div> cards (common in modern CMS platforms)

UniversalExtractor Class

class UniversalExtractor(BaseExtractor):
    """
    Generic extractor driven by a declarative adapter config.

    Does not contain platform-specific logic. All extraction behavior
    comes from the config generated at onboard time.
    """

    def __init__(self, jurisdiction_id: str, config: dict):
        super().__init__(jurisdiction_id)
        self.adapter = config  # The "adapter" dict from ExtractionConfig

    @property
    def platform_name(self) -> str:
        return "universal"

    def get_events(self, days_ahead=90, days_past=0):
        # 1. Fetch page(s) using url_template + pagination config
        # 2. Select container → rows using CSS selectors
        # 3. Extract fields per row using selector + extract mode
        # 4. Parse dates using date_format
        # 5. Filter by date range
        # 6. Return raw dicts
        ...

    def normalize_event(self, event):
        # Map extracted fields to Meeting dataclass
        ...

LLM Config Generation (Onboard Time)

Extends the Granicus generate_column_map() pattern:

def generate_adapter_config(url: str) -> dict:
    """
    Fetch a municipal meeting page and use LLM to infer extraction config.

    Returns adapter config dict with provenance.
    Raises RuntimeError if LLM cannot produce a valid config.
    """
    # 1. Fetch page (Playwright if JS-heavy, requests otherwise)
    html = fetch_page(url)

    # 2. Truncate to relevant section (largest table/list)
    sample = extract_sample(html, max_tokens=4000)

    # 3. Ask LLM to produce adapter config
    config = llm_infer_config(sample)

    # 4. Validate: required fields present, selectors parse, dates extract
    validate_adapter_config(config, html)

    # 5. Test extraction: run config against the sample page
    test_results = test_extract(config, html)
    if len(test_results) == 0:
        raise RuntimeError("Config produced 0 results on sample page")

    return config

Validation steps (critical): 1. JSON schema validation — all required fields, correct types 2. Selector validation — each CSS selector parses without error 3. Smoke extraction — run against the sample page, require ≥1 result 4. Date parsing — at least one extracted date parses with the given format 5. Title check — extracted titles are non-empty strings

If validation fails, the LLM is re-prompted once with the error. If it fails again, onboarding falls back to manual config or raises an error for human review.

Integration with Factory

# In factory.py
def create_source(config: ExtractionConfig) -> DataSource:
    if config.source_type == "legistar":
        return LegistarClient(...)
    elif config.source_type == "granicus":
        return GranicusSource(...)
    # ... existing platforms ...
    elif config.source_type == "universal":
        return UniversalExtractor(
            config.jurisdiction_id,
            config.metadata["adapter"]
        )

Integration with Platform Detection

When existing platform detectors all return negative, the discovery chain falls through to:

def _detect_universal(url: str) -> Optional[PlatformDetection]:
    """
    Last-resort detection: check if page has meeting-like content.

    Returns a detection with platform='universal' and lower confidence
    than specific platform detections.
    """
    # Heuristic: page contains date patterns + meeting-related keywords
    # Confidence: 0.30-0.50 (always lower than specific platform detections)

This ensures specific platforms are always preferred, with the universal adapter as a fallback.

Failure Modes

The universal adapter must fail explicitly, never silently return 0 results.

Detection: Config Drift

Meeting pages change over time — selectors break. Detection strategy:

Signal Meaning Response
0 rows extracted Selector completely broken Raise ExtractionError, log alert
Row count drops >50% vs. last run Partial breakage Warn, return partial results with flag
Date parsing failures >30% Format changed Warn, return what parses, flag rest
HTTP error on listing URL Page moved/removed Raise ExtractionError

Mitigation: Health Checks

UniversalExtractor.health() runs the config against the live page and checks: - Container selector matches ≥1 element - Row selector matches ≥1 element within container - At least 1 title and 1 date extract successfully

Failed health checks trigger re-generation of the adapter config (with human approval).

Versus Current Silent Failures

Current Simbli behavior when selectors break:

# Returns [] silently — indistinguishable from "no meetings scheduled"
meetings = simbli.get_events()  # len(meetings) == 0

Universal adapter behavior:

# Raises with diagnostics
ExtractionError(
    "Container selector 'table.meetings-list' matched 0 elements. "
    "Page may have changed. Last successful: 2026-03-10. "
    "Re-run config generation or inspect page manually."
)

Migration Path

Phase 1: New Platforms (immediate)

Use the universal adapter for cities with no existing client. Portland OR is the first test case — it uses a custom layout at portland.gov/council/agenda/all.

Phase 2: Simbli Migration (after validation)

Simbli is the first existing client to migrate: 1. Generate adapter config for SRCS Simbli instance 2. Run both old and new extractors in parallel, compare outputs 3. When output matches for 2+ weeks, switch to universal adapter 4. Delete Simbli-specific regex/selector code

Phase 3: Other HTML Clients (as needed)

ProudCity and any other HTML-scraping clients can migrate if the universal adapter proves reliable. API-based clients (Legistar, CivicClerk, eScribe) stay as-is — they have stable, structured interfaces that don't benefit from this pattern.

Rationale

Why onboard-time LLM, not extraction-time?

Factor Onboard-time Extraction-time
Cost 1 LLM call per platform 1 LLM call per extraction run
Determinism Config is fixed, results reproducible Results vary by LLM mood
Latency Extraction is pure HTML parsing Each run waits for LLM
Auditability Config is inspectable JSON Must log every LLM interaction
Failure mode Config drift is detectable Failures are intermittent

Onboard-time wins on every axis. This matches the proven Granicus pattern.

Why declarative config, not generated code?

Generated code (e.g., LLM writes a Python scraper) is harder to validate, harder to sandbox, and creates maintenance burden. A declarative config is: - Inspectable — humans can read and fix it - Validatable — JSON schema + smoke test - Sandboxed — no arbitrary code execution - Versionable — config changes are diffable

Why keep platform-specific clients?

API-based platforms (Legistar, CivicClerk, eScribe) have stable, documented interfaces. A universal HTML scraper adds complexity without benefit for these. The universal adapter targets specifically: - Platforms with no API (HTML-only) - Platforms where the HTML structure varies across instances (Simbli) - Custom municipal sites with no shared platform

Alternatives Considered

1. LLM at Extraction Time (Adaptive Parsing)

Send each page to the LLM for parsing on every run. Rejected: expensive ($0.01-0.05 per page × hundreds of jurisdictions × daily runs), non-deterministic, high latency.

2. Generated Python Scrapers

Have the LLM write a Python scraper class per platform. Rejected: arbitrary code execution risk, harder to validate than declarative config, maintenance burden when pages change.

3. Third-party Scraping Services (Firecrawl, Jina Reader)

Use external services for HTML-to-structured-data. Rejected: adds external dependency, cost per request, data leaves our infrastructure, less control over extraction quality.

4. One Bespoke Client Per City

Continue writing platform-specific clients. Rejected for the long tail: works for 6 platforms, does not scale to hundreds of custom municipal sites. Still the right choice for stable API platforms.

5. Community-Contributed Configs

Publish the config schema and let civic tech volunteers contribute configs for their cities. Not rejected — this is a future possibility enabled by the declarative config approach, but not part of initial implementation.

Implementation Notes

  • UniversalExtractor lives in clients/universal.py
  • Config generation lives in clients/universal_config.py
  • Factory dispatch: source_type == "universal"UniversalExtractor
  • Playwright is used for JS-heavy pages (requires_javascript: true), requests otherwise
  • Config schema validation uses jsonschema (already a dependency)
  • Provenance tracking follows the Granicus pattern: sample HTML hash, prompt version, raw LLM response

References

  • clients/granicus.py:108-221 — Existing LLM config generation pattern
  • clients/simbli.py — Brittle extraction to migrate
  • clients/base.py:413-468 — BaseExtractor interface
  • Granicus column map prompt — Lines 161-175