How it works

← Back to search

The problem

The same product is sold across Amazon, Flipkart, and Nykaa under three subtly different titles, often at three different prices. Existing comparison tools rely on brittle exact matches and miss obvious near-duplicates. PromoSensei is a four-phase build of a cross-platform deal engine that fixes this — you type “noise cancelling headphones”, it returns the canonical product with a per-platform price ladder.

The pipeline

  1. Scrape — three Playwright + BeautifulSoup scrapers, one per platform, funnel into a normalised ScrapedListing shape.
  2. Match — a brand + model number + pack-size matcher with hard guards against bundle / refurbished / size-mismatch merges, falling back to RapidFuzz token sets and embedding cosine for rephrased titles.
  3. Embed + index — pluggable embedding provider (hashing default, sentence-transformers or OpenAI swap-in), vectors stored per-canonical-product.
  4. Search — keyword / semantic / hybrid modes. The query parser lifts “under 2000” out of the free text into a listing-level filter; the ranker blends cosine similarity, normalised discount, and rating.
  5. Cache + observe — pluggable cache (in-process LRU+TTL, Redis-ready) for hot queries; structured JSON logs and a Prometheus /metrics endpoint for latency, cache hit-rate, scrape outcomes, and breaker state.

Engineering choices worth flagging

What's real vs. demoed

The scrapers, matcher, embedding pipeline, search service, cache, scheduler, breakers, eval harness, and metrics endpoint are all production code with 181 passing tests. The live deploy on this URL runs against a curated 120-product catalogue rather than scraping Amazon / Flipkart / Nykaa continuously, because all three forbid scraping in their ToS and would IP-ban the demo within hours. The scrapers in the repo are exercised against captured HTML fixtures so their parsing logic stays under test.

Stack

Read the code

The repo is organised by phase — each directory is a self-contained snapshot showing the system at that maturity level, so you can see the build-out without git-archaeology. phase4/ is what runs here.