← Back to Walmart Watch System self-test →
Data pipeline

From Walmart URL to curator-ready signal

This is what happens to every item from the moment an agent spots it on walmart.com to the moment it shows up in your list with a relative rank, composite score, and trending badge. Every step is observable in the Scrape Activity → Pipeline timeline.

1. Scrape 2. Store 3. Enrich 4. Map 5. Rank 6. Leaderboard 7. Signals

Vocabulary

Two levels of Eko hierarchy. Renamed recently — keep this in mind when reading older docs/commits:

Eko Dept 40 total

Top-level grouping. 24 Walmart-contract depts + 16 non-contract. Example: Consumer Electronics & Computing.

Eko Category 272 total

Leaf grouping under a Dept. Example: Smart Home & IoT, Headphones & Audio, Televisions & Displays.

The authoritative taxonomy lives in Eko_Taxonomy.md and walmart_to_eko_mapping.csv — the CSV has 51,558 rows mapping every Walmart L0/L1/L2/L3/L4 path to its Dept and Category.

The seven steps

1

Scrape

· Browserbase + Playwright

Agent workers spin up a cloud Chromium session, navigate to a Walmart URL, parse __NEXT_DATA__, and extract item IDs + names. Runs one agent per (walmart category, mode) combo with concurrency capped at 15 to stay under bot-detection thresholds.

Scrape modes

ModeURL shapeUsed by
best_sellers/browse/best-sellers?cat_id=XEvery 6h cron
price_low / price_highsame, &sort=price_lowNightly deep scrape
top_ratedsame, &sort=rating_highNightly deep scrape
new/shop?...&sort=new_arrival&affinityOverride=defaultNightly deep scrape
price_bracket_0-25250+/shop?...&min_price=X&max_price=YCoverage mode — 5 brackets
brand_facet/shop?...&facet=brand:XCoverage mode — top 10 brands per category
search / custom_search/search?q=XCustom scrape tab

Blocked pages (CAPTCHA / verify prompts) are detected by title, fallback URLs rotated, and warmup to walmart.com/ primes anti-bot cookies before hitting /search.

2

Store

· batch upsert

Scraped items are chunked into groups of 100 and written with a single multi-row INSERT ... ON CONFLICT DO UPDATE per table. One 200-item scrape drops from ~600 serial queries to 3–6 batched ones.

Three tables get written per chunk:

  • items — upsert on walmart_item_id; merges the discovery_modes array.
  • rankings + item_category_ranks — per-item rank observation for this (Walmart category, mode). item_category_ranks uses scraped_day as part of the PK so multiple scrapes in one day collapse to one row with the best (lowest) rank.
  • scrape_run_items — per-run discovery log; lets the dashboard show "items discovered in this run" + CSV export.

Per-item errors inside the chunk get surfaced as a sample into scrape_runs.errors — no more silent "200 items found, 0 stored" mystery states.

3

Enrich

· Eko Walmart API

Batched HTTP calls to srv.eko.com/products/get?ids=…. 50 item IDs per request, 5 concurrent requests per wave. A 200-item targeted enrichment dropped from ~4 minutes (serial) to ~2 seconds.

Response populates these columns on each item row:

  • name, brand, upc, msrp, sale_price, image_url
  • customer_rating, num_reviews, seller, marketplace
  • category_path — Walmart's internal L0/L1/L2/L3/L4 path
  • walmart_dept — the L0 value (APPAREL, CONSUMABLES, FOOD, HOME, HARDLINES, ENTERTAINMENT TOYS AND SEASONAL, HEALTH AND WELLNESS, WALMART SERVICES)
  • metadata — the full Eko API record as JSONB

After this step every item has the signals it needs to be mapped deterministically.

4

Map

· multi-phase, CSV-first

Items are routed to their Dept + Category through a series of phases. Each phase only touches items the previous phases didn't resolve. The method tag each item ends up with surfaces in the UI as a confidence badge.

PhaseRuleConfidence
0. Data qualityBlock grey-market, test listings, ghost sellers, digital/service items.filter
10.55. CSV-path matchJOIN walmart_eko_mapping on walmart_dept = L0 + progressive L1-L4 match. Deepest match wins.csv_path_match
10.6. Rank-category rescueItems scraped from a specific Walmart seed category (e.g. "TVs", "Headphones") inherit the matching Eko Category deterministically.rank_category_rescue
10.7. Path-segment rescueRegex-style match on category_path segments (e.g. /TV & Home Theater/).path_rescue
11. Keyword matchMatch category_path / item name against keyword phrases per Category. 150+ EPTs with name-based exclusions.ept_keyword_path / name
11.5. Brand rescuePrivate labels (e.g. onn., Great Value) routed by brand + keyword.brand_rescue
12. CatchallAnything still unmatched lands in the Other / Uncategorized Category of its Dept.catchall_other
13. Dept fallbackItems with no category path at all get placed by their scrape's dept code.dept_fallback
CSV-path is the source of truth. Every enriched item with a walmart_dept and a non-empty category_path gets a 100% deterministic match from the 51k-row CSV. The keyword-based phases 11+ only fire when the CSV didn't cover it.
5

Rank refresh

· denormalize

Scanning rankings for every item list query would be slow. Instead, one pass writes summary columns onto each items row:

  • current_rank / previous_rank / rank_change / rank_direction
  • rank_history_json — last 10 ranks as a JSONB array (drives the sparklines in the item list)

Fires after every scrape. Uses window functions to compute "latest rank per item per Walmart category" in one query.

6

Leaderboard

· per-Category ranking

Each item's rank within its Eko Category (the lowest level) and its Eko Dept, stored in the eko_leaderboard table:

  • rank_in_ept — 1 = best item in this Category right now
  • rank_pct_in_ept — 0..1; top 0.6% vs top 14%
  • rank_in_eko_cat / rank_pct_in_eko_cat — same but scoped to the Dept
  • best_rank + best_mode — which scrape mode (best-sellers / top-rated / price-low) saw the best rank

Rebuilt after every rank refresh via a CTE that takes the best (lowest) rank per item over a 7-day window and window-functions it into per-partition row numbers. Caps at top 5000 per Category.

7

Signals

· curation-ready metrics

Per-item computed metrics so the UI can show Top · Trending · Breakouts · Stable leaders without recomputing on each request.

SignalFormula
composite_score 0–10040 × (1 − rank_pct_in_ept) + 20 × rating/5 + 15 × log(reviews+1)/log(1001) + 10 × in_scope + 15 × mapping confidence
trending_7dRank delta between earliest and latest observation in the last 7 days. Positive = rank improved.
volatility_14dσ of rank over 14 days. Low + top = stable leader. High = CAPTCHA-flappy.
is_breakout newFirst seen within 7 days AND already in top 10% of its Category.
brand_dom_pctShare of this Category's top-20 held by this item's brand.

When things run

CronScheduleWhat
best_sellers scrapeEvery 6h (00:00, 06:00, 12:00, 18:00)Best-seller page 1-25 per category. Fast refresh.
deep scrapeNightly at 03:00best + price_low + price_high + top_rated + new. Broader coverage.
stale-item re-rankEvery 12h (at :30)5000 items not seen in >3 days get targeted Eko API re-query to refresh rank/rating.
weekly enrichmentSundays at 03:0050,000 items > 7 days old get fully re-enriched.
TTL cleanupDaily at 04:30Drop scrape_task_log rows > 30 days + item_category_ranks rows > 90 days.

Manual triggers

Available to operators without any auth. Call from anywhere:

POST /api/scrape?mode=best_sellers      # fast best-seller refresh
POST /api/scrape?mode=deep              # standard nightly scrape
POST /api/scrape?mode=coverage          # best + top_rated + new + 5 price brackets + brand facets
POST /api/enrich                        # enrich all unenriched items
POST /api/eko-import                    # re-map items (incremental)
POST /api/eko-import?reset=true         # re-map ALL items (force)
POST /api/leaderboard/refresh           # rebuild eko_leaderboard
POST /api/signals/refresh               # rebuild item_signals
POST /api/full-sweep                    # one-shot: catchall → enrich → remap → refresh everything
POST /api/custom-scrape                 # search walmart for a query (body: { query, max_items })
POST /api/discover-leaves?depth=2       # BFS-walk walmart taxonomy for new subcategories

GET  /api/self-test                     # end-to-end health check + auto-heal
GET  /api/diagnose?key=X&q=leaderboard-coverage
GET  /api/diagnose?key=X&q=refactor-health

Observability

Every refresh step is wrapped in withJob() which writes a row into the jobs table (kind, status, started_at, finished_at, duration_seconds, result, error). The Scrape Activity → Pipeline timeline tab reads from there so you see refresh_stats, refresh_leaderboard, refresh_signals land after each scrape with their durations inline.

If something failed, the run detail view surfaces the exact error in a red banner so you don't have to tail logs.

← Back to Walmart Watch Pipeline spec — updated with every deploy.