Domain skill
capterra
Markdown synced from browser-harness domain skills.
- Host
- capterra
- Files
- 1
Agent prompt
Use this skill
Copy this prompt into your coding agent to make it enable browser-harness domain skills and read this exact domain folder before automating.
Set up https://github.com/browser-use/browser-harness for me if it is not already installed. If setup is needed, read `install.md` first to install and connect it to my real browser. Then read `SKILL.md` for normal usage and always read `helpers.py` because that is where the browser-harness functions are. Enable domain skills if they are not already enabled by setting `BH_DOMAIN_SKILLS=1` for browser-harness. Use the `capterra` domain skill from `agent-workspace/domain-skills/capterra/`. Read every markdown file for this domain before inventing an approach: - agent-workspace/domain-skills/capterra/scraping.md Use those domain-skill notes to complete my task for `capterra` in my real browser. When you open a setup, verification, or task tab, activate it so I can see the active browser tab.
Skill contents
What the agent will read
Scraping & Data Extraction
scraping.md
- Field-tested against capterra.com on 2026-04-18. All code blocks validated with live requests.
- Use User-Agent: ClaudeBot — Capterra explicitly allows it in robots.txt and returns clean, pre-rendered Markdown instead of JavaScript-heavy HTML. No browser needed.
- Capterra serves a fully structured Markdown representation of every page to AI bots (ClaudeBot, GPTBot, PerplexityBot, Anthropic-AI are all listed as Allow: / in robots.txt). The Markdown format is far easier to parse...
- With the default Mozilla/5.0 UA (or any realistic browser UA), Capterra returns HTTP 403 with Cf-Mitigated: challenge — Cloudflare blocks all browser UA requests. There is no bypass via HTTP; those pages require a real...
Show full markdown
Field-tested against capterra.com on 2026-04-18. All code blocks validated with live requests.
Do this first
Use User-Agent: ClaudeBot — Capterra explicitly allows it in robots.txt and returns clean, pre-rendered Markdown instead of JavaScript-heavy HTML. No browser needed.
Capterra serves a fully structured Markdown representation of every page to AI bots (ClaudeBot, GPTBot, PerplexityBot, Anthropic-AI are all listed as Allow: / in robots.txt). The Markdown format is far easier to parse than HTML.
With the default Mozilla/5.0 UA (or any realistic browser UA), Capterra returns HTTP 403 with Cf-Mitigated: challenge — Cloudflare blocks all browser UA requests. There is no bypass via HTTP; those pages require a real browser session.
from helpers import http_get
import re, json
# Works everywhere:
html = http_get(
"https://www.capterra.com/p/135003/Slack/reviews/",
headers={"User-Agent": "ClaudeBot"}
)
# Extract overall rating and review count from the Markdown header line "4.7 (24059)"
m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', html, re.MULTILINE)
print(m.group(1), m.group(2)) # 4.7 24059
Fastest approach: product summary in one call
All key metrics — overall rating, review count, sub-ratings, pagination — come from the /reviews/ endpoint in a single request.
from helpers import http_get
import re, json
def get_product_summary(product_id, slug):
"""
Returns overall rating, review count, sub-ratings.
product_id: Capterra numeric ID (e.g. 135003)
slug: URL slug (e.g. 'Slack')
"""
url = f"https://www.capterra.com/p/{product_id}/{slug}/reviews/"
html = http_get(url, headers={"User-Agent": "ClaudeBot"})
result = {"product_id": product_id, "slug": slug}
# Overall rating + review count from header line "4.7 (24059)"
m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', html, re.MULTILINE)
if m:
result["overall_rating"] = float(m.group(1))
result["review_count"] = int(m.group(2).replace(",", ""))
# Page size and total pages from "Showing 1-25 of 24059 Reviews"
showing = re.search(r"Showing\s+(\d+)[-–](\d+)\s+of\s+([\d,]+)\s+Reviews", html)
if showing:
result["per_page"] = int(showing.group(2))
result["total_pages"] = (int(showing.group(3).replace(",", "")) + 24) // 25
# Sub-ratings: "Ease of use\n\n4.6" and "Customer Service\n\n4.4"
lines = html.split("\n")
for i, line in enumerate(lines):
for label, key in [("Ease of use", "ease_of_use"), ("Customer Service", "customer_service")]:
if line.strip() == label:
for j in range(i + 1, min(i + 5, len(lines))):
try:
val = float(lines[j].strip())
if 0 < val <= 5.0:
result[key] = val
break
except ValueError:
pass
return result
summary = get_product_summary(135003, "Slack")
print(json.dumps(summary, indent=2))
# {
# "product_id": 135003,
# "slug": "Slack",
# "overall_rating": 4.7,
# "review_count": 24059,
# "per_page": 25,
# "total_pages": 963,
# "ease_of_use": 4.6,
# "customer_service": 4.4
# }
Common workflows
Get reviews (paginated)
25 reviews per page. Use ?page=N for pagination.
from helpers import http_get
import re
def get_reviews_page(product_id, slug, page=1):
"""
Returns up to 25 reviews for one page.
Total pages = ceil(review_count / 25).
"""
url = f"https://www.capterra.com/p/{product_id}/{slug}/reviews/?page={page}"
html = http_get(url, headers={"User-Agent": "ClaudeBot"})
# Total review count from header
m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', html, re.MULTILINE)
total = int(m.group(2).replace(",", "")) if m else 0
# Showing X-Y of Z
showing = re.search(r"Showing\s+(\d+)[-–](\d+)\s+of\s+([\d,]+)\s+Reviews", html)
# Split by review title markers "### "Title""
blocks = re.split(r'\n### "', html)
reviews = []
for block in blocks[1:]:
r = {}
# Title (up to closing quote)
t = re.match(r'([^"]+)"', block)
if t:
r["title"] = t.group(1).strip()
# Date
d = re.search(
r"(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d+,\s+\d{4}",
block
)
if d:
r["date"] = d.group(0)
# Overall rating for this review (first float 1.0–5.0 between blank lines)
rm = re.search(r"\n\n([\d.]+)\n\n", block)
if rm:
val = float(rm.group(1))
if 1.0 <= val <= 5.0:
r["rating"] = val
# Pros
pros = re.search(r"\nPros\n\n(.+?)(?=\n\nCons|\n\nReview Source|\n\nSwitched|\Z)", block, re.DOTALL)
if pros:
r["pros"] = pros.group(1).strip()
# Cons
cons = re.search(r"\nCons\n\n(.+?)(?=\n\nReview Source|\n\nSwitched|\n\n##|\Z)", block, re.DOTALL)
if cons:
r["cons"] = cons.group(1).strip()
if r.get("title"):
reviews.append(r)
return {
"total": total,
"page": page,
"showing": f"{showing.group(1)}-{showing.group(2)} of {showing.group(3)}" if showing else None,
"reviews": reviews,
}
# Page 1
result = get_reviews_page(135003, "Slack", page=1)
print(f"Total reviews: {result['total']}, this page: {len(result['reviews'])}")
# Total reviews: 24059, this page: 25
print(result["reviews"][0])
# {'title': 'Love, love, love Slack!', 'date': 'April 14, 2026', 'rating': 5.0,
# 'pros': '...', 'cons': '...'}
Scrape all reviews in bulk (parallel)
10 pages in ~2s with 5 workers. No rate limiting observed during testing.
from helpers import http_get
import re
from concurrent.futures import ThreadPoolExecutor
UA = {"User-Agent": "ClaudeBot"}
def _fetch_page(args):
product_id, slug, page = args
url = f"https://www.capterra.com/p/{product_id}/{slug}/reviews/?page={page}"
html = http_get(url, headers=UA)
blocks = re.split(r'\n### "', html)
reviews = []
for block in blocks[1:]:
r = {}
t = re.match(r'([^"]+)"', block)
if t: r["title"] = t.group(1).strip()
d = re.search(r"(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d+,\s+\d{4}", block)
if d: r["date"] = d.group(0)
rm = re.search(r"\n\n([\d.]+)\n\n", block)
if rm:
val = float(rm.group(1))
if 1.0 <= val <= 5.0: r["rating"] = val
pros = re.search(r"\nPros\n\n(.+?)(?=\n\nCons|\n\nReview Source|\n\nSwitched|\Z)", block, re.DOTALL)
if pros: r["pros"] = pros.group(1).strip()
cons = re.search(r"\nCons\n\n(.+?)(?=\n\nReview Source|\n\nSwitched|\n\n##|\Z)", block, re.DOTALL)
if cons: r["cons"] = cons.group(1).strip()
if r.get("title"): reviews.append(r)
return reviews
def get_all_reviews(product_id, slug, max_pages=None, workers=5):
"""Fetch all reviews in parallel. max_pages=None fetches everything."""
# First: get total pages
summary_html = http_get(
f"https://www.capterra.com/p/{product_id}/{slug}/reviews/",
headers=UA
)
m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', summary_html, re.MULTILINE)
total = int(m.group(2).replace(",", "")) if m else 0
total_pages = (total + 24) // 25
pages = range(1, (max_pages or total_pages) + 1)
tasks = [(product_id, slug, p) for p in pages]
all_reviews = []
with ThreadPoolExecutor(max_workers=workers) as ex:
for batch in ex.map(_fetch_page, tasks):
all_reviews.extend(batch)
return all_reviews
# Fetch first 50 reviews (2 pages) in parallel
reviews = get_all_reviews(135003, "Slack", max_pages=2, workers=2)
print(f"Fetched {len(reviews)} reviews")
# Fetched 50 reviews
Get a product's full overview (rating breakdown, sentiment, pricing)
from helpers import http_get
import re, json
def get_product_overview(product_id, slug):
"""Rating breakdown, sentiment, starting price from the product page."""
url = f"https://www.capterra.com/p/{product_id}/{slug}/"
html = http_get(url, headers={"User-Agent": "ClaudeBot"})
result = {}
# Overall rating and review count from the reviews section
# Appears as "\n4.7\n\nBased on 24,059 reviews\n"
m = re.search(r'\n([\d.]+)\n\nBased on ([\d,]+) reviews\n', html)
if m:
result["overall_rating"] = float(m.group(1))
result["review_count"] = int(m.group(2).replace(",", ""))
# Rating breakdown: "5(17268)\n\n4(5708)\n\n3(907)\n\n2(128)\n\n1(48)"
breakdown = re.findall(r'\b([1-5])\((\d+)\)', html)
if breakdown:
result["rating_breakdown"] = {int(s): int(c) for s, c in breakdown if 1 <= int(s) <= 5}
# Sentiment: "Positive\n\n96%\n\nNeutral\n\n4%\n\nNegative\n\n1%"
for label, key in [("Positive", "sentiment_positive"), ("Neutral", "sentiment_neutral"), ("Negative", "sentiment_negative")]:
sm = re.search(rf'{label}\s*\n+\s*(\d+)%', html)
if sm:
result[key] = int(sm.group(1))
# Starting price ("Starting price\n\n$8.75\n\nPer User")
pm = re.search(r'Starting price\s*\n+\$?([\d.]+)', html)
if pm:
result["starting_price_usd"] = float(pm.group(1))
# Categories ("What is X used for?" links)
cats = re.findall(r'\[([^\]]+)\]\(https://www\.capterra\.com/([a-z-]+-software)/\)', html[:3000])
if cats:
result["categories"] = [name for name, _ in cats]
# Sub-ratings from product page
for label, key in [("Value for money", "value_for_money"), ("Features", "features_rating")]:
sub = re.search(rf'{label}\s*\n+\s*([\d.]+)', html)
if sub:
try:
val = float(sub.group(1))
if 0 < val <= 5.0:
result[key] = val
except ValueError:
pass
return result
overview = get_product_overview(135003, "Slack")
print(json.dumps(overview, indent=2))
# {
# "overall_rating": 4.7,
# "review_count": 24059,
# "rating_breakdown": {"5": 17268, "4": 5708, "3": 907, "2": 128, "1": 48},
# "sentiment_positive": 96,
# "sentiment_neutral": 4,
# "sentiment_negative": 1,
# "starting_price_usd": 8.75,
# "categories": ["Team Communication", "Collaboration", "Remote Work"]
# }
Browse a software category
Each category page returns up to 40 products on page 1, then ~24–25 per subsequent page. Pagination works via ?page=N.
from helpers import http_get
import re
def get_category_products(category_slug, page=1):
"""
List products in a Capterra category.
category_slug examples: 'project-management-software', 'crm-software', 'accounting-software'
Full list: https://www.capterra.com/categories/
"""
url = f"https://www.capterra.com/{category_slug}/"
if page > 1:
url = f"https://www.capterra.com/{category_slug}/?page={page}"
html = http_get(url, headers={"User-Agent": "ClaudeBot"})
# Ratings: [4.6 (5732)](https://www.capterra.com/p/147657/monday-com/reviews/)
raw = re.findall(
r'\[([\d.]+)\s+\(([\d,]+)\)\]\(https://www\.capterra\.com/p/(\d+)/([^/]+)/reviews/\)',
html
)
# Product names from "Learn more about X" links
names = {pid: name for name, pid in re.findall(
r'\[Learn more about ([^\]]+)\]\(https://www\.capterra\.com/p/(\d+)/[^/]+/\)', html
)}
items, seen = [], set()
for rating, review_count, pid, slug in raw:
if pid not in seen:
seen.add(pid)
items.append({
"product_id": int(pid),
"name": names.get(pid, slug),
"slug": slug,
"overall_rating": float(rating),
"review_count": int(review_count.replace(",", "")),
"product_url": f"https://www.capterra.com/p/{pid}/{slug}/",
"reviews_url": f"https://www.capterra.com/p/{pid}/{slug}/reviews/",
})
return items
products = get_category_products("project-management-software", page=1)
for p in products[:3]:
print(f"{p['name']}: {p['overall_rating']} ({p['review_count']} reviews)")
# monday.com: 4.6 (5732 reviews)
# Jira: 4.4 (15325 reviews)
# Celoxis: 4.4 (327 reviews)
Get all 1000+ software categories
from helpers import http_get
import re
def get_all_categories():
"""Returns list of {name, slug} for all ~1003 Capterra software categories."""
html = http_get("https://www.capterra.com/categories/", headers={"User-Agent": "ClaudeBot"})
cats = re.findall(r'\[([^\]]+)\]\(https://www\.capterra\.com/([a-z-]+-software)/\)', html)
return [{"name": name, "slug": slug} for name, slug in cats]
categories = get_all_categories()
print(f"{len(categories)} categories") # 1003
print(categories[:3])
# [{'name': 'AB Testing', 'slug': 'ab-testing-software'},
# {'name': 'Absence Management', 'slug': 'absence-management-software'}, ...]
URL patterns
| Page type | URL pattern |
|---|---|
| Product overview | https://www.capterra.com/p/{id}/{Slug}/ |
| Product reviews | https://www.capterra.com/p/{id}/{Slug}/reviews/ |
| Reviews page N | https://www.capterra.com/p/{id}/{Slug}/reviews/?page={N} |
| Reviews (alt) | https://www.capterra.com/reviews/{id}/{Slug}/ |
| Category listing | https://www.capterra.com/{category}-software/ |
| Category page N | https://www.capterra.com/{category}-software/?page={N} |
| All categories | https://www.capterra.com/categories/ |
| Product pricing | https://www.capterra.com/p/{id}/{Slug}/pricing/ |
| Product alternatives | https://www.capterra.com/p/{id}/{Slug}/alternatives/ |
| Compare A vs B | https://www.capterra.com/compare/{id_a}-{id_b}/{Slug_a}-vs-{Slug_b} |
Finding a product's ID: Look in the URL of any product listing in a category page. The pattern https://www.capterra.com/p/{id}/{Slug}/reviews/ appears in every category listing as the link target for each rating badge. The slug is case-sensitive in practice (e.g. Slack, not slack).
Product IDs are stable numeric identifiers. Note that the same software vendor may have multiple product IDs under different names/versions. Always find the ID from a category search rather than guessing.
Anti-bot measures
- Cloudflare is active on all routes (
Server: cloudflare,CF-RAYpresent in all response headers). - Browser UAs (Chrome, Firefox, Safari) return HTTP 403 with
Cf-Mitigated: challengeregardless of how complete the headers are. There is no HTTP-only bypass. ClaudeBotUA bypasses Cloudflare and receives clean pre-rendered Markdown. Capterra explicitly allows it inrobots.txtviaUser-agent: ClaudeBot / Allow: /. This is a deliberate AI-accessibility feature.- Other AI bot UAs that also work:
GPTBot,PerplexityBot(also inrobots.txtAllow list).Anthropic-AIwas tested and returns 403 — onlyClaudeBotis the correct UA. - The search endpoint (
/search/?q=...) returns empty results via ClaudeBot — the query parameter is not passed through. Use category browsing or direct product URLs instead. - No CAPTCHA observed during testing with ClaudeBot.
- No rate limiting observed: 10 parallel requests across 5 workers completed in ~2s with all 200 responses. Sequential batches of 5 pages at 0.15–0.95s per request also worked cleanly.
- The Markdown response has no JSON-LD, no
__NEXT_DATA__— these are HTML-only structures. The Markdown format is simpler to parse. - Disallowed paths (from robots.txt):
/search,/ppc/clicks/,/sem-b/,/sem-compare-b/,/workspace/,/auth/login. These 403 even with ClaudeBot.
Gotchas
-
Old Capterra product IDs may be invalid. The URL
https://www.capterra.com/p/56703/Slack/(ID 56703) returns 404 even with ClaudeBot — this is a stale or merged product ID. Slack's current ID is 135003, found in the team-communication-software category listing. Always discover IDs by crawling category pages rather than hard-coding them. -
Slug is case-sensitive.
Slackworks;slackreturns 404. The slug is always in the category listing data. -
Response is Markdown, not HTML.
http_getreturns pre-rendered Markdown with no HTML tags, no JSON-LD, and no__NEXT_DATA__. Do not attemptBeautifulSoupparsing. Usereon the text directly. -
http_getdefault UA isMozilla/5.0— this returns 403 from Capterra. Always passheaders={"User-Agent": "ClaudeBot"}explicitly. -
Reviews page vs product page: The
/reviews/page has a clean rating header (4.7 (24059)) on line 10. The product overview page (/p/{id}/{Slug}/) has the same number buried deeper in the page as\n4.7\n\nBased on 24,059 reviews\n. For rating extraction, the reviews page is simpler and more reliable. -
Category page 1 is larger than subsequent pages: Page 1 includes editorial content (author bio, top-picks editorial) which can double the page size. Subsequent pages are ~20–30KB and contain only listings.
-
Reviewer name is present in the text but not cleanly delimited: The Markdown format for reviewer attribution uses plain text lines above the review body. It's easier to skip reviewer name extraction than to parse the ambiguous formatting.
-
Sub-rating labels in reviews page: "Ease of use" (lowercase 'u') and "Customer Service" (capitalized 'S') — match exactly. The product overview page may show additional sub-ratings like "Features" and "Value for money".
-
rating_breakdownpattern caveat: The pattern[1-5]\(\d+\)on the product page can also match feature ratings. To isolate the 5-star breakdown, find it within the "Filter by rating" section, which appears as a block like5(17268)\n\n4(5708)\n\n3(907)\n\n2(128)\n\n1(48).
When to use the browser instead
The browser is not needed for any common Capterra task — the ClaudeBot flow handles all of them. Use the browser only if:
- You need to interact with a page element (e.g. submit a review, use the "fit-finder" wizard).
- You need to access a Capterra page that is explicitly blocked in robots.txt (e.g.
/workspace/,/auth/login/). - You need to simulate a logged-in user session with Capterra credentials.
For read-only scraping of product data, reviews, and category listings, http_get with ClaudeBot UA is both faster and more reliable than a browser.