About This Project
The J谩r贸kel艖 RAG System builds a comprehensive Retrieval-Augmented Generation (RAG) pipeline for data from J谩r贸kel艖.hu, a Hungarian civic platform for reporting and tracking public issues across Budapest and other Hungarian cities.
This project came from a personal passion for civic engagement. As someone who regularly bikes around Budapest, I've witnessed the disrepair, vandalism, and neglect of our public spaces. Instead of falling into apathy, I began using the J谩r贸kel艖 platform and discovered hope that even in challenging political and financial climates, issues can and will be resolved when properly reported and tracked.
Key Features
- Automated Data Collection: Daily scraping of civic reports using Selenium and BeautifulSoup
- Advanced Preprocessing: Text cleaning, normalization, and intelligent chunking for optimal retrieval
- Vector Store & Embeddings: FAISS and Chroma integration with multiple embedding model support
- RAG Pipeline: Semantic search and LLM-powered response generation using OpenAI's ChatGPT
- Interactive UI: Streamlit-based interface for querying and debugging
- MLOps Automation: GitHub Actions workflows for continuous data processing and evaluation
- Comprehensive Analytics: PowerBI dashboards and experimental evaluations
The system enables users to browse issues by district, category, and status while generating summaries and insights from reports using AI-powered semantic search and pattern detection.
My Experiments & Tools
Embedding Comparison
Compare different embedding models to see how they represent issue descriptions, allowing better semantic search and clustering of similar reports. Benchmarks Hungarian, English, and multilingual models on real civic data.
View EmbeddingsRAG Evaluation
Comprehensive automated evaluation of our Retrieval-Augmented Generation pipeline using standard IR metrics (hit rate, recall@k, precision@k). Tracks performance across different models, top-k values, and languages.
View RAG EvalStreamlit UI
Interactive web interface for querying the RAG system with debug capabilities. Features query input, real-time processing logs, and detailed response analysis for civic issue exploration.
Learn MoreEmbeddings Visualizations
Interactive 2D scatter plots of civic issues mapped by content similarity using t-SNE dimensionality reduction. Explore patterns and clusters colored by district, status, category, or institution.
View VisualizationsPower BI Dashboard
Comprehensive interactive dashboard visualizing civic issues across Budapest by district, category, status, and temporal trends. Currently in development with automated CSV export pipeline.
Coming SoonPerformance Optimizations
Production-grade performance engineering with 17x startup improvements through intelligent URL index caching. Includes benchmarking analysis of async vs sync approaches, real-world performance testing, and scalability projections to 1M+ records.
View Performance ReportOptimized Status Pipeline
Revolutionary 24x performance improvement for status updates. Transformed 6-hour timeout-prone process into smart 4-job pipeline completing in 10-15 minutes. Features targeted URL detection and parallel processing architecture.
View Pipeline Optimization