This is what happens to every item from the moment an agent spots it on walmart.com to the moment it shows up in your list with a relative rank, composite score, and trending badge. Every step is observable in the Scrape Activity → Pipeline timeline.
Two levels of Eko hierarchy. Renamed recently — keep this in mind when reading older docs/commits:
Top-level grouping. 24 Walmart-contract depts + 16 non-contract. Example: Consumer Electronics & Computing.
Leaf grouping under a Dept. Example: Smart Home & IoT, Headphones & Audio, Televisions & Displays.
The authoritative taxonomy lives in Eko_Taxonomy.md and walmart_to_eko_mapping.csv — the CSV has 51,558 rows mapping every Walmart L0/L1/L2/L3/L4 path to its Dept and Category.
Agent workers spin up a cloud Chromium session, navigate to a Walmart URL, parse __NEXT_DATA__, and extract item IDs + names. Runs one agent per (walmart category, mode) combo with concurrency capped at 15 to stay under bot-detection thresholds.
| Mode | URL shape | Used by |
|---|---|---|
best_sellers | /browse/best-sellers?cat_id=X | Every 6h cron |
price_low / price_high | same, &sort=price_low | Nightly deep scrape |
top_rated | same, &sort=rating_high | Nightly deep scrape |
new | /shop?...&sort=new_arrival&affinityOverride=default | Nightly deep scrape |
price_bracket_0-25 … 250+ | /shop?...&min_price=X&max_price=Y | Coverage mode — 5 brackets |
brand_facet | /shop?...&facet=brand:X | Coverage mode — top 10 brands per category |
search / custom_search | /search?q=X | Custom scrape tab |
Blocked pages (CAPTCHA / verify prompts) are detected by title, fallback URLs rotated, and warmup to walmart.com/ primes anti-bot cookies before hitting /search.
Scraped items are chunked into groups of 100 and written with a single multi-row INSERT ... ON CONFLICT DO UPDATE per table. One 200-item scrape drops from ~600 serial queries to 3–6 batched ones.
Three tables get written per chunk:
items — upsert on walmart_item_id; merges the discovery_modes array.rankings + item_category_ranks — per-item rank observation for this (Walmart category, mode). item_category_ranks uses scraped_day as part of the PK so multiple scrapes in one day collapse to one row with the best (lowest) rank.scrape_run_items — per-run discovery log; lets the dashboard show "items discovered in this run" + CSV export.Per-item errors inside the chunk get surfaced as a sample into scrape_runs.errors — no more silent "200 items found, 0 stored" mystery states.
Batched HTTP calls to srv.eko.com/products/get?ids=…. 50 item IDs per request, 5 concurrent requests per wave. A 200-item targeted enrichment dropped from ~4 minutes (serial) to ~2 seconds.
Response populates these columns on each item row:
name, brand, upc, msrp, sale_price, image_urlcustomer_rating, num_reviews, seller, marketplacecategory_path — Walmart's internal L0/L1/L2/L3/L4 pathwalmart_dept — the L0 value (APPAREL, CONSUMABLES, FOOD, HOME, HARDLINES, ENTERTAINMENT TOYS AND SEASONAL, HEALTH AND WELLNESS, WALMART SERVICES)metadata — the full Eko API record as JSONBAfter this step every item has the signals it needs to be mapped deterministically.
Items are routed to their Dept + Category through a series of phases. Each phase only touches items the previous phases didn't resolve. The method tag each item ends up with surfaces in the UI as a confidence badge.
| Phase | Rule | Confidence |
|---|---|---|
| 0. Data quality | Block grey-market, test listings, ghost sellers, digital/service items. | filter |
| 10.55. CSV-path match | JOIN walmart_eko_mapping on walmart_dept = L0 + progressive L1-L4 match. Deepest match wins. | csv_path_match |
| 10.6. Rank-category rescue | Items scraped from a specific Walmart seed category (e.g. "TVs", "Headphones") inherit the matching Eko Category deterministically. | rank_category_rescue |
| 10.7. Path-segment rescue | Regex-style match on category_path segments (e.g. /TV & Home Theater/). | path_rescue |
| 11. Keyword match | Match category_path / item name against keyword phrases per Category. 150+ EPTs with name-based exclusions. | ept_keyword_path / name |
| 11.5. Brand rescue | Private labels (e.g. onn., Great Value) routed by brand + keyword. | brand_rescue |
| 12. Catchall | Anything still unmatched lands in the Other / Uncategorized Category of its Dept. | catchall_other |
| 13. Dept fallback | Items with no category path at all get placed by their scrape's dept code. | dept_fallback |
walmart_dept and a non-empty category_path gets a 100% deterministic match from the 51k-row CSV. The keyword-based phases 11+ only fire when the CSV didn't cover it.
Scanning rankings for every item list query would be slow. Instead, one pass writes summary columns onto each items row:
current_rank / previous_rank / rank_change / rank_directionrank_history_json — last 10 ranks as a JSONB array (drives the sparklines in the item list)Fires after every scrape. Uses window functions to compute "latest rank per item per Walmart category" in one query.
Each item's rank within its Eko Category (the lowest level) and its Eko Dept, stored in the eko_leaderboard table:
rank_in_ept — 1 = best item in this Category right nowrank_pct_in_ept — 0..1; top 0.6% vs top 14%rank_in_eko_cat / rank_pct_in_eko_cat — same but scoped to the Deptbest_rank + best_mode — which scrape mode (best-sellers / top-rated / price-low) saw the best rankRebuilt after every rank refresh via a CTE that takes the best (lowest) rank per item over a 7-day window and window-functions it into per-partition row numbers. Caps at top 5000 per Category.
Per-item computed metrics so the UI can show Top · Trending · Breakouts · Stable leaders without recomputing on each request.
| Signal | Formula |
|---|---|
composite_score 0–100 | 40 × (1 − rank_pct_in_ept) + 20 × rating/5 + 15 × log(reviews+1)/log(1001) + 10 × in_scope + 15 × mapping confidence |
trending_7d | Rank delta between earliest and latest observation in the last 7 days. Positive = rank improved. |
volatility_14d | σ of rank over 14 days. Low + top = stable leader. High = CAPTCHA-flappy. |
is_breakout new | First seen within 7 days AND already in top 10% of its Category. |
brand_dom_pct | Share of this Category's top-20 held by this item's brand. |
| Cron | Schedule | What |
|---|---|---|
best_sellers scrape | Every 6h (00:00, 06:00, 12:00, 18:00) | Best-seller page 1-25 per category. Fast refresh. |
deep scrape | Nightly at 03:00 | best + price_low + price_high + top_rated + new. Broader coverage. |
stale-item re-rank | Every 12h (at :30) | 5000 items not seen in >3 days get targeted Eko API re-query to refresh rank/rating. |
weekly enrichment | Sundays at 03:00 | 50,000 items > 7 days old get fully re-enriched. |
TTL cleanup | Daily at 04:30 | Drop scrape_task_log rows > 30 days + item_category_ranks rows > 90 days. |
Available to operators without any auth. Call from anywhere:
POST /api/scrape?mode=best_sellers # fast best-seller refresh
POST /api/scrape?mode=deep # standard nightly scrape
POST /api/scrape?mode=coverage # best + top_rated + new + 5 price brackets + brand facets
POST /api/enrich # enrich all unenriched items
POST /api/eko-import # re-map items (incremental)
POST /api/eko-import?reset=true # re-map ALL items (force)
POST /api/leaderboard/refresh # rebuild eko_leaderboard
POST /api/signals/refresh # rebuild item_signals
POST /api/full-sweep # one-shot: catchall → enrich → remap → refresh everything
POST /api/custom-scrape # search walmart for a query (body: { query, max_items })
POST /api/discover-leaves?depth=2 # BFS-walk walmart taxonomy for new subcategories
GET /api/self-test # end-to-end health check + auto-heal
GET /api/diagnose?key=X&q=leaderboard-coverage
GET /api/diagnose?key=X&q=refactor-health
Every refresh step is wrapped in withJob() which writes a row into the jobs table (kind, status, started_at, finished_at, duration_seconds, result, error). The Scrape Activity → Pipeline timeline tab reads from there so you see refresh_stats, refresh_leaderboard, refresh_signals land after each scrape with their durations inline.
If something failed, the run detail view surfaces the exact error in a red banner so you don't have to tail logs.