ADR: Universal Adapter for Municipal Data Extraction¶
Status: Accepted Date: 2026-03-13
Decision¶
Introduce a declarative extraction config interpreted by a generic UniversalExtractor client. An LLM generates the config at onboard time by inspecting sample HTML; extraction itself is deterministic — no LLM calls at runtime. Platform-specific clients remain for platforms with stable APIs (Legistar, CivicClerk, eScribe). The universal adapter targets HTML-rendered platforms that lack APIs: custom municipal sites, Simbli variants, and the long tail.
Context¶
CivicOS has 6 platform-specific extraction clients covering the major meeting platforms:
| Client | Approach | Stability |
|---|---|---|
| Legistar | REST API (OData) | High — structured API |
| CivicClerk | REST API (OData) | High — structured API |
| eScribe | JSON API (AJAX) | High — structured API |
| Granicus | HTML scraping + LLM column map | Medium — LLM generates config once |
| Simbli | Playwright + hardcoded regex/CSS | Low — brittle to DOM changes |
| ProudCity | HTML scraping (bespoke) | Medium — custom selectors |
The problem: Adding a new platform requires writing a new client class. Major cities like NYC run custom platforms. Simbli's regex-based parsing breaks across districts. We need a way to onboard arbitrary HTML-based meeting pages without writing bespoke code each time.
Existing proof of concept: Granicus's generate_column_map() already uses an LLM to infer table structure from sample HTML, then extracts deterministically using the inferred mapping. The universal adapter generalizes this pattern beyond tables.
Architecture¶
Overview¶
┌─────────────────────────────────────────────────┐
│ Onboard Time │
│ │
│ URL → fetch HTML → LLM infers config → validate │
│ (Playwright) (structured output) │
│ │ │
│ ▼ │
│ ExtractionConfig.metadata │
│ { │
│ "adapter": { ... } │
│ } │
└─────────────────────────────────────────────────┘
│
│ saved to data/extraction/{jurisdiction}.json
▼
┌─────────────────────────────────────────────────┐
│ Extraction Time │
│ │
│ URL → fetch HTML → apply config → Meeting[] │
│ (Playwright (CSS selectors, │
│ or requests) date parsing, │
│ pagination) │
│ │
│ No LLM calls. Deterministic. Auditable. │
└─────────────────────────────────────────────────┘
Adapter Config Schema¶
The LLM generates a declarative config stored in ExtractionConfig.metadata["adapter"]:
{
"adapter": {
"version": 1,
"page_type": "table" | "list" | "card",
"listing": {
"url_template": "https://example.gov/meetings?page={page}",
"container": "table.meetings-list",
"row": "tbody tr",
"fields": {
"title": { "selector": "td:nth-child(1)", "extract": "text" },
"date": { "selector": "td:nth-child(2)", "extract": "text", "date_format": "%B %d, %Y" },
"time": { "selector": "td:nth-child(3)", "extract": "text" },
"agenda_url": { "selector": "td:nth-child(4) a", "extract": "href" },
"minutes_url": { "selector": "td:nth-child(5) a", "extract": "href" },
"video_url": { "selector": "td:nth-child(6) a", "extract": "href" }
}
},
"pagination": {
"type": "none" | "next_link" | "page_param" | "load_more",
"next_selector": "a.next-page",
"max_pages": 10
},
"detail": {
"url_field": "agenda_url",
"fields": {
"title": { "selector": "h1", "extract": "text" },
"time": { "selector": "time.datetime", "extract": "text" },
"location": { "selector": "strong", "extract": "text" },
"video_url": { "selector": "a[href*='youtube.com']", "extract": "href" }
}
},
"requires_javascript": false,
"provenance": {
"sample_url": "https://example.gov/meetings",
"sample_html_hash": "sha256:abc123...",
"generated_at": "2026-03-13T10:00:00Z",
"prompt_version": "universal_adapter/v2"
}
}
}
Required fields: title and date in the listing. All others optional.
Two-level extraction: Many municipal sites have thin listing pages (just links) with rich detail pages. The optional detail section tells the adapter to follow each listing link and extract additional fields (time, location, video). The LLM generates both configs at onboard time by sampling one listing page and one detail page. Extraction remains deterministic.
extract modes: - "text" — .get_text(strip=True) - "href" — ["href"], resolved to absolute URL - "attr:NAME" — arbitrary attribute - "html" — inner HTML (for rich content)
page_type variants: - "table" — Standard HTML table (container is <table>, row is <tr>) - "list" — <ul>/<ol> or <div> list (container is wrapper, row is list item) - "card" — Repeated <div> cards (common in modern CMS platforms)
UniversalExtractor Class¶
class UniversalExtractor(BaseExtractor):
"""
Generic extractor driven by a declarative adapter config.
Does not contain platform-specific logic. All extraction behavior
comes from the config generated at onboard time.
"""
def __init__(self, jurisdiction_id: str, config: dict):
super().__init__(jurisdiction_id)
self.adapter = config # The "adapter" dict from ExtractionConfig
@property
def platform_name(self) -> str:
return "universal"
def get_events(self, days_ahead=90, days_past=0):
# 1. Fetch page(s) using url_template + pagination config
# 2. Select container → rows using CSS selectors
# 3. Extract fields per row using selector + extract mode
# 4. Parse dates using date_format
# 5. Filter by date range
# 6. Return raw dicts
...
def normalize_event(self, event):
# Map extracted fields to Meeting dataclass
...
LLM Config Generation (Onboard Time)¶
Extends the Granicus generate_column_map() pattern:
def generate_adapter_config(url: str) -> dict:
"""
Fetch a municipal meeting page and use LLM to infer extraction config.
Returns adapter config dict with provenance.
Raises RuntimeError if LLM cannot produce a valid config.
"""
# 1. Fetch page (Playwright if JS-heavy, requests otherwise)
html = fetch_page(url)
# 2. Truncate to relevant section (largest table/list)
sample = extract_sample(html, max_tokens=4000)
# 3. Ask LLM to produce adapter config
config = llm_infer_config(sample)
# 4. Validate: required fields present, selectors parse, dates extract
validate_adapter_config(config, html)
# 5. Test extraction: run config against the sample page
test_results = test_extract(config, html)
if len(test_results) == 0:
raise RuntimeError("Config produced 0 results on sample page")
return config
Validation steps (critical): 1. JSON schema validation — all required fields, correct types 2. Selector validation — each CSS selector parses without error 3. Smoke extraction — run against the sample page, require ≥1 result 4. Date parsing — at least one extracted date parses with the given format 5. Title check — extracted titles are non-empty strings
If validation fails, the LLM is re-prompted once with the error. If it fails again, onboarding falls back to manual config or raises an error for human review.
Integration with Factory¶
# In factory.py
def create_source(config: ExtractionConfig) -> DataSource:
if config.source_type == "legistar":
return LegistarClient(...)
elif config.source_type == "granicus":
return GranicusSource(...)
# ... existing platforms ...
elif config.source_type == "universal":
return UniversalExtractor(
config.jurisdiction_id,
config.metadata["adapter"]
)
Integration with Platform Detection¶
When existing platform detectors all return negative, the discovery chain falls through to:
def _detect_universal(url: str) -> Optional[PlatformDetection]:
"""
Last-resort detection: check if page has meeting-like content.
Returns a detection with platform='universal' and lower confidence
than specific platform detections.
"""
# Heuristic: page contains date patterns + meeting-related keywords
# Confidence: 0.30-0.50 (always lower than specific platform detections)
This ensures specific platforms are always preferred, with the universal adapter as a fallback.
Failure Modes¶
The universal adapter must fail explicitly, never silently return 0 results.
Detection: Config Drift¶
Meeting pages change over time — selectors break. Detection strategy:
| Signal | Meaning | Response |
|---|---|---|
| 0 rows extracted | Selector completely broken | Raise ExtractionError, log alert |
| Row count drops >50% vs. last run | Partial breakage | Warn, return partial results with flag |
| Date parsing failures >30% | Format changed | Warn, return what parses, flag rest |
| HTTP error on listing URL | Page moved/removed | Raise ExtractionError |
Mitigation: Health Checks¶
UniversalExtractor.health() runs the config against the live page and checks: - Container selector matches ≥1 element - Row selector matches ≥1 element within container - At least 1 title and 1 date extract successfully
Failed health checks trigger re-generation of the adapter config (with human approval).
Versus Current Silent Failures¶
Current Simbli behavior when selectors break:
# Returns [] silently — indistinguishable from "no meetings scheduled"
meetings = simbli.get_events() # len(meetings) == 0
Universal adapter behavior:
# Raises with diagnostics
ExtractionError(
"Container selector 'table.meetings-list' matched 0 elements. "
"Page may have changed. Last successful: 2026-03-10. "
"Re-run config generation or inspect page manually."
)
Migration Path¶
Phase 1: New Platforms (immediate)¶
Use the universal adapter for cities with no existing client. Portland OR is the first test case — it uses a custom layout at portland.gov/council/agenda/all.
Phase 2: Simbli Migration (after validation)¶
Simbli is the first existing client to migrate: 1. Generate adapter config for SRCS Simbli instance 2. Run both old and new extractors in parallel, compare outputs 3. When output matches for 2+ weeks, switch to universal adapter 4. Delete Simbli-specific regex/selector code
Phase 3: Other HTML Clients (as needed)¶
ProudCity and any other HTML-scraping clients can migrate if the universal adapter proves reliable. API-based clients (Legistar, CivicClerk, eScribe) stay as-is — they have stable, structured interfaces that don't benefit from this pattern.
Rationale¶
Why onboard-time LLM, not extraction-time?¶
| Factor | Onboard-time | Extraction-time |
|---|---|---|
| Cost | 1 LLM call per platform | 1 LLM call per extraction run |
| Determinism | Config is fixed, results reproducible | Results vary by LLM mood |
| Latency | Extraction is pure HTML parsing | Each run waits for LLM |
| Auditability | Config is inspectable JSON | Must log every LLM interaction |
| Failure mode | Config drift is detectable | Failures are intermittent |
Onboard-time wins on every axis. This matches the proven Granicus pattern.
Why declarative config, not generated code?¶
Generated code (e.g., LLM writes a Python scraper) is harder to validate, harder to sandbox, and creates maintenance burden. A declarative config is: - Inspectable — humans can read and fix it - Validatable — JSON schema + smoke test - Sandboxed — no arbitrary code execution - Versionable — config changes are diffable
Why keep platform-specific clients?¶
API-based platforms (Legistar, CivicClerk, eScribe) have stable, documented interfaces. A universal HTML scraper adds complexity without benefit for these. The universal adapter targets specifically: - Platforms with no API (HTML-only) - Platforms where the HTML structure varies across instances (Simbli) - Custom municipal sites with no shared platform
Alternatives Considered¶
1. LLM at Extraction Time (Adaptive Parsing)¶
Send each page to the LLM for parsing on every run. Rejected: expensive ($0.01-0.05 per page × hundreds of jurisdictions × daily runs), non-deterministic, high latency.
2. Generated Python Scrapers¶
Have the LLM write a Python scraper class per platform. Rejected: arbitrary code execution risk, harder to validate than declarative config, maintenance burden when pages change.
3. Third-party Scraping Services (Firecrawl, Jina Reader)¶
Use external services for HTML-to-structured-data. Rejected: adds external dependency, cost per request, data leaves our infrastructure, less control over extraction quality.
4. One Bespoke Client Per City¶
Continue writing platform-specific clients. Rejected for the long tail: works for 6 platforms, does not scale to hundreds of custom municipal sites. Still the right choice for stable API platforms.
5. Community-Contributed Configs¶
Publish the config schema and let civic tech volunteers contribute configs for their cities. Not rejected — this is a future possibility enabled by the declarative config approach, but not part of initial implementation.
Implementation Notes¶
UniversalExtractorlives inclients/universal.py- Config generation lives in
clients/universal_config.py - Factory dispatch:
source_type == "universal"→UniversalExtractor - Playwright is used for JS-heavy pages (
requires_javascript: true),requestsotherwise - Config schema validation uses
jsonschema(already a dependency) - Provenance tracking follows the Granicus pattern: sample HTML hash, prompt version, raw LLM response
References¶
clients/granicus.py:108-221— Existing LLM config generation patternclients/simbli.py— Brittle extraction to migrateclients/base.py:413-468— BaseExtractor interface- Granicus column map prompt — Lines 161-175