AI Hybrid Search of Legislative Data

/ Project Overview

This project turns bill data from the U.S. Congress into a fast, filterable research tool by combining Congress.gov ingestion, Typesense hybrid search, and vector AI retrieval. It layers an AI assistant on top of the search system to answer plain-English questions with clickable source bills. The same stack also powers individual bill pages, where an AI chat can drill into bill text, actions, and metadata.

Explore the AI Search Tool

Challenges & Objectives

/ What This Delivers

Convert raw Congress.gov records into clean, searchable, filterable documents
Use vector embeddings to match real-world questions to structured legislative data
Keep answers grounded in retrieved bills, with UI-rendered source links for trust
Enable bill-level deep dives via AI chat on individual bill pages

/ Challenges

Normalize inconsistent data across endpoints and fill taxonomy gaps (especially “status of legislation”)
Scale indexing efficiently (thousands of bills) while controlling memory, retries and partial failures
Ship a fast UI with filters that do not break layout, plus an assistant that stays useful without hallucinating
Support deep per-bill Q&A that references bill content and the action timeline, not just a generic summary

/ Objectives

Deliver hybrid search (keyword + vector) with robust facets (chamber, policy area, committees, sponsor party, updated date and status)
Generate and store AI summaries + embeddings during crawling so the UI stays fast
Add AI Q&A on the search page that synthesizes retrieved results and supports follow-ups
Add AI chat on bill pages that explains the bill, interprets actions and answers “what does this mean?” style questions
Build a resilient pipeline with state tracking + a failed queue for automatic retries and targeted backfills

Build & Stack Overview

/ Data + Crawling

Congress.gov API → fetch bill metadata, summaries, committees and actions
GitHub Actions (Node.js crawler) → normalize, enrich and upsert into search index
Derived status taxonomy from /bill/{congress}/{type}/{number}/actions (with AI fallback only when necessary)
State + failure queue → incremental runs, soft-fail per bill, retry next run

/ Vector AI Data Layer

OpenAI embeddings generated from title + key metadata + AI summary
Store vectors alongside docs in Typesense to enable semantic retrieval
Run hybrid ranking so exact queries (“H.R. 123”) and concept queries (“wildfire mitigation grants”) both work
For enterprise level, option to host AI-embeddings on Pinecone to improve performance

/ Search + Retrieval + AI Interactions

Typesense Cloud → fast keyword search, filters, facets, hybrid vector search
Shared retrieval layer feeds both the search UI and the bill detail experience
Search-page assistant → answers questions using retrieved bills only; UI prints the sources + links
Bill-page AI chat → uses the bill’s stored text/details to answer deeper questions (actions, intent, implications)
Guardrails to keep responses concise, grounded and traceable
Cloudflare Worker → secure proxy for keys, per-bill lookups and API orchestration