← Back to skills

Domain skill

walmart

Markdown synced from browser-harness domain skills.

Agent prompt

Use this skill

Copy this prompt into your coding agent to make it enable browser-harness domain skills and read this exact domain folder before automating.

Set up https://github.com/browser-use/browser-harness for me if it is not already installed. If setup is needed, read `install.md` first to install and connect it to my real browser. Then read `SKILL.md` for normal usage and always read `helpers.py` because that is where the browser-harness functions are.

Enable domain skills if they are not already enabled by setting `BH_DOMAIN_SKILLS=1` for browser-harness. Use the `walmart` domain skill from `agent-workspace/domain-skills/walmart/`. Read every markdown file for this domain before inventing an approach:
- agent-workspace/domain-skills/walmart/scraping.md

Use those domain-skill notes to complete my task for `walmart` in my real browser. When you open a setup, verification, or task tab, activate it so I can see the active browser tab.

Skill contents

What the agent will read

Product Search & Data Extraction

scraping.md

Source
  • Field-tested against walmart.com on 2026-04-18 using httpget (no browser required). All code blocks were run and outputs verified against live responses.
  • ---
  • Walmart's Next.js SSR embeds the full search or product payload as JSON in a <script id="NEXTDATA"> tag. No browser needed for search or product detail pages. 2–3 s per page fetch; no CAPTCHA or session cookies required.
  • The bare Mozilla/5.0 string bypasses PerimeterX. Any UA that looks like a headless client or includes a recognizable browser fingerprint triggers the JS challenge page.
Show full markdown

Field-tested against walmart.com on 2026-04-18 using http_get (no browser required). All code blocks were run and outputs verified against live responses.


Fastest Approach: http_get with __NEXT_DATA__

Walmart's Next.js SSR embeds the full search or product payload as JSON in a <script id="__NEXT_DATA__"> tag. No browser needed for search or product detail pages. ~2–3 s per page fetch; no CAPTCHA or session cookies required.

Critical UA rule

User-AgentResult
Mozilla/5.0 (bare)Full HTML + __NEXT_DATA__use this
Mozilla/5.0 ... Chrome/120 ... (full)PerimeterX "Robot or human?" challenge (200, 15 KB)
Safari/17 full UAWorks (full HTML, ~1.15 MB)
curl/7.xPerimeterX challenge
python-requests/2.31PerimeterX challenge

The bare Mozilla/5.0 string bypasses PerimeterX. Any UA that looks like a headless client or includes a recognizable browser fingerprint triggers the JS challenge page.

Base fetch helper

python
import json, re, gzip, urllib.request

def fetch_walmart(url):
    """
    Fetch any walmart.com page.
    Returns decoded HTML string.
    Raises RuntimeError if PerimeterX bot challenge is returned.
    """
    req = urllib.request.Request(
        url,
        headers={"User-Agent": "Mozilla/5.0", "Accept-Encoding": "gzip"},
    )
    with urllib.request.urlopen(req, timeout=20) as r:
        data = r.read()
        if r.headers.get("Content-Encoding") == "gzip":
            data = gzip.decompress(data)
        html = data.decode()
    if "Robot or human" in html:
        raise RuntimeError(f"PerimeterX challenge triggered: {url}")
    return html

def parse_next_data(html):
    m = re.search(r'id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
    if not m:
        raise ValueError("__NEXT_DATA__ not found — page structure may have changed")
    return json.loads(m.group(1))

Search Results

URL patterns

python
# Keyword search
"https://www.walmart.com/search?q=laptop"

# Pagination — append &page=N
"https://www.walmart.com/search?q=laptop&page=2"

# Sort options (confirmed working)
"https://www.walmart.com/search?q=laptop&sort=best_match"    # default
"https://www.walmart.com/search?q=laptop&sort=best_seller"
"https://www.walmart.com/search?q=laptop&sort=price_low"
"https://www.walmart.com/search?q=laptop&sort=customer_rating"

# Price filter
"https://www.walmart.com/search?q=laptop&min_price=200&max_price=500"

# Browse by category (department ID path)
"https://www.walmart.com/browse/electronics/laptops/3944_1089430_3951"

__NEXT_DATA__ path to items

code
data
  .props.pageProps.initialData.searchResult
    .aggregatedCount        — int: total matching products (e.g. 18818)
    .paginationV2.maxPage   — int: last page number
    .itemStacks[]           — array of stacks (usually 2: sponsored + organic)
      .items[]              — array of product objects

Full extractor (field-tested)

python
def extract_search_results(html):
    """
    Returns (items, total_count, max_page).
    items is a list of dicts with confirmed fields.
    """
    data = parse_next_data(html)
    sr = data["props"]["pageProps"]["initialData"]["searchResult"]

    items = []
    for stack in sr.get("itemStacks", []):
        for item in stack.get("items", []):
            pi = item.get("priceInfo") or {}
            img = item.get("imageInfo") or {}
            rating = item.get("rating") or {}
            avail = item.get("availabilityStatusV2") or {}
            items.append({
                "usItemId":        item.get("usItemId"),           # str, Walmart item ID
                "name":            item.get("name"),               # str
                "brand":           item.get("brand"),              # str or None
                "price":           item.get("price"),              # int, current price in USD
                "linePrice":       pi.get("linePrice"),            # str "$429.00"
                "wasPrice":        pi.get("wasPrice") or None,     # str "$699.00" or None
                "savings":         pi.get("savings") or None,      # str "SAVE $270.00" or None
                "averageRating":   rating.get("averageRating"),    # float e.g. 4.3
                "numberOfReviews": rating.get("numberOfReviews"),  # int
                "availability":    avail.get("value"),             # "IN_STOCK" / "OUT_OF_STOCK"
                "isSponsored":     bool(item.get("isSponsoredFlag")),
                "url":             "https://www.walmart.com" + (item.get("canonicalUrl") or "").split("?")[0],
                "thumbnailUrl":    img.get("thumbnailUrl"),
            })

    total = sr.get("aggregatedCount")
    max_page = (sr.get("paginationV2") or {}).get("maxPage")
    return items, total, max_page


# Usage
html = fetch_walmart("https://www.walmart.com/search?q=laptop")
items, total, max_page = extract_search_results(html)
# items: 66 items on page 1, total=18818, max_page=11

# Filter out sponsored
organic = [i for i in items if not i["isSponsored"]]

Field notes (confirmed)

  • usItemId: string, matches the numeric ID at the end of /ip/.../ITEMID URLs. Some non-product rows (ad widgets) have usItemId=None — filter with if item.get("usItemId").
  • price: integer cents-less price (e.g. 429 for "$429.00"). Use priceInfo.linePrice for the formatted string including the dollar sign.
  • wasPrice / savings: only present when item is on sale. Always None for full-price items.
  • isSponsoredFlag: the first batch of results across both itemStacks are frequently sponsored. On a laptop search, ~56 of 66 SSR items carry isSponsoredFlag: true.
  • rating: present on ~91% of items (60/66 in test). averageRating is a float; numberOfReviews is int.
  • canonicalUrl: always includes ?classType=...&athbdg=... query params — strip with .split("?")[0] to get a clean URL.
  • Two itemStacks: Walmart returns two stacks (itemStacks[0] and itemStacks[1]). Merge them. itemStacks[0] is the primary grid; itemStacks[1] is a secondary sponsored/related block.

Pagination

python
for page in range(1, max_page + 1):
    html = fetch_walmart(f"https://www.walmart.com/search?q=laptop&page={page}")
    items, _, _ = extract_search_results(html)
    # process items...

Page responses average ~2.5 s each. No rate-limiting was observed across 3 sequential requests. For bulk scraping, add a 1–2 s delay between requests to be safe.


Product Detail Page

URL pattern

code
https://www.walmart.com/ip/{slug}/{usItemId}

The slug is ignored in routing — only the numeric usItemId matters. These work identically:

code
https://www.walmart.com/ip/anything/19717318352
https://www.walmart.com/ip/Apple-MacBook-Neo/19717318352

__NEXT_DATA__ path on a product page

code
data.props.pageProps.initialData.data
  .product        — core product object
  .idml           — long description, specs, highlights, warranty
  .reviews        — rating breakdown + first 10 customer reviews (SSR)

Full extractor (field-tested)

python
def extract_product_detail(html):
    """
    Returns a dict with all confirmed product fields.
    idml.specifications returns all spec rows as a flat dict.
    reviews returns the SSR-rendered first 10 customer reviews.
    """
    data = parse_next_data(html)
    d = data["props"]["pageProps"]["initialData"]["data"]
    product = d["product"]
    idml    = d.get("idml") or {}
    reviews = d.get("reviews") or {}

    pi = product.get("priceInfo") or {}
    cp = pi.get("currentPrice") or {}
    img = product.get("imageInfo") or {}
    avail = product.get("availabilityStatusV2") or {}

    specs = {
        spec.get("name"): spec.get("value")
        for spec in (idml.get("specifications") or [])
    }

    all_images = [
        img_item.get("url")
        for img_item in (img.get("allImages") or [])
        if img_item.get("url")
    ]

    customer_reviews = [
        {
            "title":    r.get("reviewTitle"),
            "rating":   r.get("rating"),           # int 1-5 (field is "rating", NOT "overallRating")
            "text":     r.get("reviewText"),
            "author":   r.get("userNickname"),
            "date":     r.get("reviewSubmissionTime"),
        }
        for r in (reviews.get("customerReviews") or [])
    ]

    return {
        # identity
        "usItemId":            product.get("usItemId"),
        "name":                product.get("name"),
        "brand":               product.get("brand"),
        "model":               product.get("model"),
        "upc":                 product.get("upc"),
        # price
        "price":               cp.get("price"),            # float, e.g. 599
        "priceString":         cp.get("priceString"),      # "$599.00"
        "wasPrice":            (pi.get("wasPrice") or {}).get("priceString"),
        "savings":             (pi.get("savings") or {}).get("savingsString"),
        # availability
        "availability":        avail.get("value"),         # "IN_STOCK" / "OUT_OF_STOCK"
        "availabilityDisplay": avail.get("display"),       # "In stock"
        # ratings
        "averageRating":       product.get("averageRating"),
        "numberOfReviews":     product.get("numberOfReviews"),
        # text
        "shortDescription":    product.get("shortDescription"),
        "longDescription":     idml.get("longDescription"),  # HTML string
        # media
        "thumbnailUrl":        img.get("thumbnailUrl"),
        "allImages":           all_images,          # up to 10 image URLs
        # specs
        "specifications":      specs,               # {"Brand": "Apple", "Processor": "A18 Pro", ...}
        "highlights":          [                    # top highlighted specs with icons
            {"name": h.get("name"), "value": h.get("value")}
            for h in (idml.get("productHighlights") or [])
        ],
        # URL
        "canonicalUrl":        "https://www.walmart.com" + (product.get("canonicalUrl") or ""),
        # fulfillment
        "fulfillmentOptions":  product.get("fulfillmentOptions") or [],
        # reviews (SSR-rendered, first 10)
        "reviewSummary": {
            "averageOverallRating":    reviews.get("averageOverallRating"),
            "totalReviewCount":        reviews.get("totalReviewCount"),
            "reviewsWithTextCount":    reviews.get("reviewsWithTextCount"),
            "recommendedPercentage":   reviews.get("recommendedPercentage"),
        },
        "customerReviews":     customer_reviews,
    }


# Usage
url = "https://www.walmart.com/ip/Apple-MacBook-Neo/19717318352"
html = fetch_walmart(url)
product = extract_product_detail(html)

# Example output (confirmed live):
# product["name"]         → "Apple MacBook Neo 13-inch Apple A18 Pro chip..."
# product["price"]        → 599
# product["priceString"]  → "$599.00"
# product["availability"] → "IN_STOCK"
# product["model"]        → "MHFD4LL/A"
# product["upc"]          → "195950852745"
# len(product["specifications"])  → 29 spec rows
# len(product["allImages"])       → 10
# product["specifications"]["Processor"] → "A18 Pro"

Field notes (confirmed)

  • averageRating / numberOfReviews on the product node: present for items with reviews. New/few-review items may return None for both.
  • reviewSummary.averageOverallRating in the reviews node often differs slightly from product.averageRating — the reviews node is more precise (e.g. 4.75 vs 4.8).
  • customerReviews (SSR): always the first 10 reviews. The per-review rating field is "rating" (int 1–5), not "overallRating" (which is always None).
  • longDescription: raw HTML string including <ul>/<li> tags. Strip tags before display.
  • specifications: flat dict — confirmed 29–31 rows for electronics. Key names use display labels (e.g. "RAM memory", "Screen size", "HD capacity").
  • wasPrice / savings on detail page: same as search — None when item is not discounted.
  • No JSON-LD: Walmart product pages do not include <script type="application/ld+json">. All structured data lives in __NEXT_DATA__.

Anti-Bot: PerimeterX

Walmart uses PerimeterX (app ID PXu6b0qd2S, confirmed in runtimeConfig.perimeterX).

SignalDetail
Bot detectorPerimeterX
Challenge page"Robot or human?" — 200 OK, 15 KB HTML
Triggered byFull browser UA strings (Chrome, curl, python-requests)
Bypassed byUser-Agent: Mozilla/5.0 (bare prefix only)
No JS executionSSR response is complete — no JS challenge to solve

Detection in code:

python
if "Robot or human" in html:
    raise RuntimeError("PerimeterX challenge — switch to browser harness")

If http_get starts returning the challenge after a run of successful fetches, switch to the browser harness (see below).


Browser Harness Fallback

Use the browser harness when:

  • PerimeterX starts blocking http_get on your IP
  • You need to interact with the page (add to cart, filter UI, infinite scroll)
  • You need variant switching (color/size selectors)
python
# Browser-based search extraction
new_tab("https://www.walmart.com/search?q=laptop")
wait_for_load()
wait(2)  # JS renders product cards after readyState=complete

# Extract via __NEXT_DATA__ in-browser (identical structure to http_get)
import json
nd = js("document.getElementById('__NEXT_DATA__')?.textContent")
data = json.loads(nd)
sr = data["props"]["pageProps"]["initialData"]["searchResult"]
items = []
for stack in sr.get("itemStacks", []):
    items.extend(stack.get("items", []))

Browser selectors (confirmed working for DOM-based extraction)

python
# Product cards on search results page
results = js("""
  Array.from(document.querySelectorAll('[data-item-id]')).map(el => ({
    itemId:    el.getAttribute('data-item-id'),
    name:      el.querySelector('[itemprop="name"]')?.innerText?.trim(),
    price:     el.querySelector('[itemprop="price"]')?.getAttribute('content'),
    url:       el.querySelector('a[link-identifier]')?.href,
  })).filter(r => r.itemId)
""")

# If [data-item-id] misses items, use the Next.js data attribute alternative:
results_alt = js("""
  Array.from(document.querySelectorAll('[data-testid="list-view"]'))
    .map(el => el.innerText.trim())
""")

Prefer __NEXT_DATA__ over DOM selectors even in-browser — the JSON is complete and stable. DOM class names at Walmart are obfuscated and change between deployments.

Session gotcha

Always open Walmart with new_tab() on first visit:

python
new_tab("https://www.walmart.com/search?q=laptop")
wait_for_load()
wait(2)

After that, goto_url() works normally within the same session.


Public API

Walmart's affiliate/partner API (developer.api.walmart.com) requires a registered API key and returns HTTP 403 without one. No unauthenticated public product API is available. The __NEXT_DATA__ SSR approach replaces any need for the official API for read-only data.


Gotchas

  • UA must be Mozilla/5.0 bare: Any fuller string (Chrome, Safari, curl, requests) hits PerimeterX. This is counterintuitive — the shorter, less realistic UA is the one that works.

  • Regex must use id= attribute match: The regex r'<script id="__NEXT_DATA__" type="application/json">...' fails because the actual tag is <script id="__NEXT_DATA__"> without type. Use:

    python
    re.search(r'id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
    
  • usItemId can be None: ~5/66 items on a page are non-product ad widgets with no usItemId. Always filter: [i for i in items if i.get("usItemId")].

  • Two itemStacks: Walmart returns two stacks. Iterate over all stacks or you'll miss ~10 items from the second stack.

  • canonicalUrl includes tracking params: Always strip with .split("?")[0].

  • Review field is "rating" not "overallRating": Each customerReviews entry has a "rating" int field (1–5). The "overallRating" field is always None. Don't confuse with product.averageRating (the aggregate float).

  • No JSON-LD on product pages: Zero <script type="application/ld+json"> tags were found. All structured data is in __NEXT_DATA__.

  • longDescription is HTML: Strip tags before text use. May contain promotional/financing copy mixed with real product description.

  • Page sizes vary: Page 1 returned 66 items across 2 stacks; page 2 returned 55. Do not assume a fixed items-per-page count.

  • http_get default already sends Mozilla/5.0: helpers.http_get() uses "User-Agent": "Mozilla/5.0" by default — no override needed when calling it directly. Only pass a custom headers= if you need to change something else.

  • developer.api.walmart.com returns HTTP 403 without an API key. Not usable for unauthenticated scraping.