AI Hybrid Search of Legislative Data

/ Project Overview

This project turns bill data from the U.S. Congress into a fast, filterable research tool by combining Congress.gov ingestion, Typesense hybrid search, and vector AI retrieval. It layers an AI assistant on top of the search system to answer plain-English questions with clickable source bills. The same stack also powers individual bill pages, where an AI chat can drill into bill text, actions, and metadata.

Challenges & Objectives

/ What This Delivers

  • Convert raw Congress.gov records into clean, searchable, filterable documents

  • Use vector embeddings to match real-world questions to structured legislative data

  • Keep answers grounded in retrieved bills, with UI-rendered source links for trust

  • Enable bill-level deep dives via AI chat on individual bill pages

/ Challenges

  • Normalize inconsistent data across endpoints and fill taxonomy gaps (especially “status of legislation”)

  • Scale indexing efficiently (thousands of bills) while controlling memory, retries and partial failures

  • Ship a fast UI with filters that do not break layout, plus an assistant that stays useful without hallucinating

  • Support deep per-bill Q&A that references bill content and the action timeline, not just a generic summary

/ Objectives

  • Deliver hybrid search (keyword + vector) with robust facets (chamber, policy area, committees, sponsor party, updated date and status)

  • Generate and store AI summaries + embeddings during crawling so the UI stays fast

  • Add AI Q&A on the search page that synthesizes retrieved results and supports follow-ups

  • Add AI chat on bill pages that explains the bill, interprets actions and answers “what does this mean?” style questions

  • Build a resilient pipeline with state tracking + a failed queue for automatic retries and targeted backfills

Build & Stack Overview

AI_search_iphone_nobkg

/ Data + Crawling

  • Congress.gov API → fetch bill metadata, summaries, committees and actions

  • GitHub Actions (Node.js crawler) → normalize, enrich and upsert into search index

  • Derived status taxonomy from /bill/{congress}/{type}/{number}/actions (with AI fallback only when necessary)

  • State + failure queue → incremental runs, soft-fail per bill, retry next run

/ Vector AI Data Layer

  • OpenAI embeddings generated from title + key metadata + AI summary

  • Store vectors alongside docs in Typesense to enable semantic retrieval

  • Run hybrid ranking so exact queries (“H.R. 123”) and concept queries (“wildfire mitigation grants”) both work

  • For enterprise level, option to host AI-embeddings on Pinecone to improve performance

/ Search + Retrieval + AI Interactions

  • Typesense Cloud → fast keyword search, filters, facets, hybrid vector search

  • Shared retrieval layer feeds both the search UI and the bill detail experience

  • Search-page assistant → answers questions using retrieved bills only; UI prints the sources + links

  • Bill-page AI chat → uses the bill’s stored text/details to answer deeper questions (actions, intent, implications)

  • Guardrails to keep responses concise, grounded and traceable

  • Cloudflare Worker → secure proxy for keys, per-bill lookups and API orchestration