How it works
← Back to searchThe problem
The same product is sold across Amazon, Flipkart, and Nykaa under three subtly different titles, often at three different prices. Existing comparison tools rely on brittle exact matches and miss obvious near-duplicates. PromoSensei is a four-phase build of a cross-platform deal engine that fixes this — you type “noise cancelling headphones”, it returns the canonical product with a per-platform price ladder.
The pipeline
- Scrape — three Playwright + BeautifulSoup scrapers, one per platform, funnel into a normalised
ScrapedListingshape. - Match — a brand + model number + pack-size matcher with hard guards against bundle / refurbished / size-mismatch merges, falling back to RapidFuzz token sets and embedding cosine for rephrased titles.
- Embed + index — pluggable embedding provider (hashing default, sentence-transformers or OpenAI swap-in), vectors stored per-canonical-product.
- Search — keyword / semantic / hybrid modes. The query parser lifts “under 2000” out of the free text into a listing-level filter; the ranker blends cosine similarity, normalised discount, and rating.
- Cache + observe — pluggable cache (in-process LRU+TTL, Redis-ready) for hot queries; structured JSON logs and a Prometheus
/metricsendpoint for latency, cache hit-rate, scrape outcomes, and breaker state.
Engineering choices worth flagging
- Filters apply at the listing level, not the product level. So “earbuds under ₹2000” still surfaces a product that's overpriced on Amazon if its Flipkart listing fits — the user gets one card with the cheap listing highlighted.
- Per-platform circuit breakers + retry with jitter. A Flipkart outage doesn't take Amazon and Nykaa down with it, and the breaker stops us from hammering a dead platform.
- CI ranking-quality gate. 15 hand-labeled queries with NDCG@5 and Precision@3 thresholds; merges fail if either headline metric regresses. The harness is hand-rolled (no scikit-learn just to compute three numbers).
- No vendor lock-in on observability. Logs are line-delimited JSON;
/metricsis Prometheus text format. Drop in any backend without code changes.
What's real vs. demoed
The scrapers, matcher, embedding pipeline, search service, cache, scheduler, breakers, eval harness, and metrics endpoint are all production code with 181 passing tests. The live deploy on this URL runs against a curated 120-product catalogue rather than scraping Amazon / Flipkart / Nykaa continuously, because all three forbid scraping in their ToS and would IP-ban the demo within hours. The scrapers in the repo are exercised against captured HTML fixtures so their parsing logic stays under test.
Stack
- Backend: FastAPI · SQLAlchemy · Pydantic v2 · APScheduler · RapidFuzz · BeautifulSoup · Playwright · pytest
- Frontend: Next.js 14 (App Router) · React · TypeScript · Tailwind
- Data: PostgreSQL (Neon) · in-process JSON-vector index (pgvector swap-in documented)
- Deploy: Vercel (frontend) · Render (backend, Docker) · GitHub Actions CI
Read the code
The repo is organised by phase — each directory is a self-contained snapshot showing the system at that maturity level, so you can see the build-out without git-archaeology. phase4/ is what runs here.