civicos-extraction¶
Platform parsers for extracting civic data from municipal websites and government APIs.
Location: packages/civicos-extraction/
Supported Platforms¶
Municipal Meeting Platforms¶
| Platform | Type | Jurisdictions |
|---|---|---|
| ProudCity | Web scraper | San Rafael (primary pilot) |
| Granicus | API | Marin County |
| Legistar | API | Berkeley, Oakland, SF, Richmond, Hayward, San Pablo |
| CivicClerk | API + OData | El Cerrito, Hayward, San Pablo, Richmond, Vallejo, Antioch |
| BoardDocs | POST/HTML | Ross Valley SD, MCOE, Larkspur-Corte Madera SD, Sausalito-Marin City SD, College of Marin |
| Simbli | Playwright | San Rafael City Schools, Novato USD, Tamalpais Union HSD |
| eScribe | HTML | (available, no current jurisdictions) |
Community Issues¶
| Platform | Type | Coverage |
|---|---|---|
| SeeClickFix | API | San Rafael (311 reports) |
Legislation & Federal Data¶
| Platform | Type | Coverage |
|---|---|---|
| LegiScan | API | State + federal bills |
| Federal Register | API | Executive orders, regulations |
| FAC v2 | API | Federal Audit Clearinghouse (grants) |
| USAspending | API | Federal spending |
| SAM Assistance | API | Federal assistance programs |
| HUD Exchange | API | CDBG data |
Financial¶
| Platform | Type | Coverage |
|---|---|---|
| CA State Controller | Web | ACFR financial data |
| CA Grants | API | State grant programs |
Elections & Results¶
| Platform | Type | Coverage |
|---|---|---|
| Marin Registrar | GraphQL | County election results (46 elections, 2010–present) |
| CA Secretary of State | REST | Statewide races, ballot measures, county breakdowns |
| Google Civic | API | Voter info, polling locations (reps endpoint deprecated) |
| San Rafael Clerk | Web | City candidates, local measures |
Media¶
| Platform | Type | Coverage |
|---|---|---|
| YouTube Boards | Web/API | Meeting video extraction |
Extraction Pipeline¶
All extractors follow a 4-stage pattern:
- FETCH: Pull raw data from platform API or website
- NORMALIZE: Convert to CivicOS schema (meetings, decisions, etc.)
- VALIDATE: Check against JSON schema
- STORE: Persist to storage backend
Features: - Checkpoint system for crash recovery (JSON files) - Request throttling and exponential backoff - Schema validation against MEETING_SCHEMA - Manifest tracking per extraction run
Adding a New Extractor¶
Implement the base interface:
class MyExtractor:
def get_events(self, days_ahead=30, days_past=0) -> List[Dict]:
"""Raw platform data"""
def get_meetings(self, days_ahead=30) -> List[Meeting]:
"""Normalized to CivicOS schema"""
def health(self) -> HealthStatus:
"""Platform availability check"""
Register in the factory (clients/factory.py) and add to SUPPORTED_MEETING_SOURCES in clients/__init__.py so the standard pipeline dispatches to it.
Config-Driven Ingestion¶
Each jurisdiction has an extraction config JSON in data/extraction/:
{
"source_id": "boarddocs-ca-rova",
"source_type": "boarddocs",
"jurisdiction_id": "school-ross-valley",
"base_url": "https://go.boarddocs.com/ca/rova/Board.nsf",
"metadata": {
"app_path": "ca/rova",
"committee_id": "AB9A2R259AF0"
}
}
The ingestion pipeline reads source_type to dispatch to the correct client. All platforms in SUPPORTED_MEETING_SOURCES work with the standard fetch_meetings() dispatcher in scripts/modal_ingest.py.
Platform Discovery¶
The onboarding system auto-detects platforms from URLs or city names:
| Platform | Auto-discovery method |
|---|---|
| Legistar | Probe {slug}.legistar.com API |
| CivicClerk | Probe {slug}.civicclerk.com OData |
| Granicus | Probe {slug}.granicus.com + view IDs |
| eScribe | Probe pub-{slug}.escribemeetings.com |
| Simbli | Probe {slug}.simbli.com subdomains |
| BoardDocs | URL detection (go.boarddocs.com/{state}/{site}) + committee auto-discovery |
| ProudCity | Scrape city website /meetings/ page |
BoardDocs note: Site codes can't be guessed from district names. During agentic onboarding, use WebSearch (site:go.boarddocs.com "{district name}") to find the URL, then pass it to onboard_jurisdiction() which handles everything else automatically.