From Walmart URL to curator-ready signal

This is what happens to every item from the moment an agent spots it on walmart.com to the moment it shows up in your list with a relative rank, composite score, and trending badge. Every step is observable in the Scrape Activity → Pipeline timeline.

1. Scrape → 2. Store → 3. Enrich → 4. Map → 5. Rank → 6. Leaderboard → 7. Signals

Vocabulary

Two levels of Eko hierarchy. Renamed recently — keep this in mind when reading older docs/commits:

Eko Dept 40 total

Top-level grouping. 24 Walmart-contract depts + 16 non-contract. Example: Consumer Electronics & Computing.

Eko Category 272 total

Leaf grouping under a Dept. Example: Smart Home & IoT, Headphones & Audio, Televisions & Displays.

The authoritative taxonomy lives in Eko_Taxonomy.md and walmart_to_eko_mapping.csv — the CSV has 51,558 rows mapping every Walmart L0/L1/L2/L3/L4 path to its Dept and Category.

The seven steps

Scrape

· Browserbase + Playwright

Agent workers spin up a cloud Chromium session, navigate to a Walmart URL, parse __NEXT_DATA__, and extract item IDs + names. Runs one agent per (walmart category, mode) combo with concurrency capped at 15 to stay under bot-detection thresholds.

Scrape modes

Mode	URL shape	Used by
`best_sellers`	`/browse/best-sellers?cat_id=X`	Every 6h cron
`price_low` / `price_high`	same, `&sort=price_low`	Nightly deep scrape
`top_rated`	same, `&sort=rating_high`	Nightly deep scrape
`new`	`/shop?...&sort=new_arrival&affinityOverride=default`	Nightly deep scrape
`price_bracket_0-25` … `250+`	`/shop?...&min_price=X&max_price=Y`	Coverage mode — 5 brackets
`brand_facet`	`/shop?...&facet=brand:X`	Coverage mode — top 10 brands per category
`search` / `custom_search`	`/search?q=X`	Custom scrape tab

Blocked pages (CAPTCHA / verify prompts) are detected by title, fallback URLs rotated, and warmup to walmart.com/ primes anti-bot cookies before hitting /search.

Store

· batch upsert

Scraped items are chunked into groups of 100 and written with a single multi-row INSERT ... ON CONFLICT DO UPDATE per table. One 200-item scrape drops from ~600 serial queries to 3–6 batched ones.

Three tables get written per chunk:

items — upsert on walmart_item_id; merges the discovery_modes array.
rankings + item_category_ranks — per-item rank observation for this (Walmart category, mode). item_category_ranks uses scraped_day as part of the PK so multiple scrapes in one day collapse to one row with the best (lowest) rank.
scrape_run_items — per-run discovery log; lets the dashboard show "items discovered in this run" + CSV export.

Per-item errors inside the chunk get surfaced as a sample into scrape_runs.errors — no more silent "200 items found, 0 stored" mystery states.

Enrich

· Eko Walmart API

Batched HTTP calls to srv.eko.com/products/get?ids=…. 50 item IDs per request, 5 concurrent requests per wave. A 200-item targeted enrichment dropped from ~4 minutes (serial) to ~2 seconds.

Response populates these columns on each item row:

name, brand, upc, msrp, sale_price, image_url
customer_rating, num_reviews, seller, marketplace
category_path — Walmart's internal L0/L1/L2/L3/L4 path
walmart_dept — the L0 value (APPAREL, CONSUMABLES, FOOD, HOME, HARDLINES, ENTERTAINMENT TOYS AND SEASONAL, HEALTH AND WELLNESS, WALMART SERVICES)
metadata — the full Eko API record as JSONB

After this step every item has the signals it needs to be mapped deterministically.

Map

· multi-phase, CSV-first

Items are routed to their Dept + Category through a series of phases. Each phase only touches items the previous phases didn't resolve. The method tag each item ends up with surfaces in the UI as a confidence badge.

Phase	Rule	Confidence
0. Data quality	Block grey-market, test listings, ghost sellers, digital/service items.	filter
10.55. CSV-path match	JOIN `walmart_eko_mapping` on `walmart_dept = L0` + progressive L1-L4 match. Deepest match wins.	csv_path_match
10.6. Rank-category rescue	Items scraped from a specific Walmart seed category (e.g. "TVs", "Headphones") inherit the matching Eko Category deterministically.	rank_category_rescue
10.7. Path-segment rescue	Regex-style match on `category_path` segments (e.g. `/TV & Home Theater/`).	path_rescue
11. Keyword match	Match `category_path` / item name against keyword phrases per Category. 150+ EPTs with name-based exclusions.	ept_keyword_path / name
11.5. Brand rescue	Private labels (e.g. `onn.`, `Great Value`) routed by brand + keyword.	brand_rescue
12. Catchall	Anything still unmatched lands in the Other / Uncategorized Category of its Dept.	catchall_other
13. Dept fallback	Items with no category path at all get placed by their scrape's dept code.	dept_fallback

CSV-path is the source of truth. Every enriched item with a walmart_dept and a non-empty category_path gets a 100% deterministic match from the 51k-row CSV. The keyword-based phases 11+ only fire when the CSV didn't cover it.

Rank refresh

· denormalize

Scanning rankings for every item list query would be slow. Instead, one pass writes summary columns onto each items row:

current_rank / previous_rank / rank_change / rank_direction
rank_history_json — last 10 ranks as a JSONB array (drives the sparklines in the item list)

Fires after every scrape. Uses window functions to compute "latest rank per item per Walmart category" in one query.

Leaderboard

· per-Category ranking

Each item's rank within its Eko Category (the lowest level) and its Eko Dept, stored in the eko_leaderboard table:

rank_in_ept — 1 = best item in this Category right now
rank_pct_in_ept — 0..1; top 0.6% vs top 14%
rank_in_eko_cat / rank_pct_in_eko_cat — same but scoped to the Dept
best_rank + best_mode — which scrape mode (best-sellers / top-rated / price-low) saw the best rank

Rebuilt after every rank refresh via a CTE that takes the best (lowest) rank per item over a 7-day window and window-functions it into per-partition row numbers. Caps at top 5000 per Category.

Signals

· curation-ready metrics

Per-item computed metrics so the UI can show Top · Trending · Breakouts · Stable leaders without recomputing on each request.

Signal	Formula
`composite_score` 0–100	40 × (1 − rank_pct_in_ept) + 20 × rating/5 + 15 × log(reviews+1)/log(1001) + 10 × in_scope + 15 × mapping confidence
`trending_7d`	Rank delta between earliest and latest observation in the last 7 days. Positive = rank improved.
`volatility_14d`	σ of rank over 14 days. Low + top = stable leader. High = CAPTCHA-flappy.
`is_breakout` new	First seen within 7 days AND already in top 10% of its Category.
`brand_dom_pct`	Share of this Category's top-20 held by this item's brand.

When things run

Cron	Schedule	What
`best_sellers scrape`	Every 6h (00:00, 06:00, 12:00, 18:00)	Best-seller page 1-25 per category. Fast refresh.
`deep scrape`	Nightly at 03:00	best + price_low + price_high + top_rated + new. Broader coverage.
`stale-item re-rank`	Every 12h (at :30)	5000 items not seen in >3 days get targeted Eko API re-query to refresh rank/rating.
`weekly enrichment`	Sundays at 03:00	50,000 items > 7 days old get fully re-enriched.
`TTL cleanup`	Daily at 04:30	Drop `scrape_task_log` rows > 30 days + `item_category_ranks` rows > 90 days.

Manual triggers

Available to operators without any auth. Call from anywhere:

POST /api/scrape?mode=best_sellers      # fast best-seller refresh
POST /api/scrape?mode=deep              # standard nightly scrape
POST /api/scrape?mode=coverage          # best + top_rated + new + 5 price brackets + brand facets
POST /api/enrich                        # enrich all unenriched items
POST /api/eko-import                    # re-map items (incremental)
POST /api/eko-import?reset=true         # re-map ALL items (force)
POST /api/leaderboard/refresh           # rebuild eko_leaderboard
POST /api/signals/refresh               # rebuild item_signals
POST /api/full-sweep                    # one-shot: catchall → enrich → remap → refresh everything
POST /api/custom-scrape                 # search walmart for a query (body: { query, max_items })
POST /api/discover-leaves?depth=2       # BFS-walk walmart taxonomy for new subcategories

GET  /api/self-test                     # end-to-end health check + auto-heal
GET  /api/diagnose?key=X&q=leaderboard-coverage
GET  /api/diagnose?key=X&q=refactor-health

Observability

Every refresh step is wrapped in withJob() which writes a row into the jobs table (kind, status, started_at, finished_at, duration_seconds, result, error). The Scrape Activity → Pipeline timeline tab reads from there so you see refresh_stats, refresh_leaderboard, refresh_signals land after each scrape with their durations inline.

If something failed, the run detail view surfaces the exact error in a red banner so you don't have to tail logs.

← Back to Walmart Watch Pipeline spec — updated with every deploy.