← Back to skills

Domain skill

news-aggregation

Markdown synced from browser-harness domain skills.

Host
news-aggregation
Files
1

Agent prompt

Use this skill

Copy this prompt into your coding agent to make it enable browser-harness domain skills and read this exact domain folder before automating.

Set up https://github.com/browser-use/browser-harness for me if it is not already installed. If setup is needed, read `install.md` first to install and connect it to my real browser. Then read `SKILL.md` for normal usage and always read `helpers.py` because that is where the browser-harness functions are.

Enable domain skills if they are not already enabled by setting `BH_DOMAIN_SKILLS=1` for browser-harness. Use the `news-aggregation` domain skill from `agent-workspace/domain-skills/news-aggregation/`. Read every markdown file for this domain before inventing an approach:
- agent-workspace/domain-skills/news-aggregation/multi-source.md

Use those domain-skill notes to complete my task for `news-aggregation` in my real browser. When you open a setup, verification, or task tab, activate it so I can see the active browser tab.

Skill contents

What the agent will read

Multi-Source

multi-source.md

Source
  • Field-tested against TechCrunch, The Verge, Ars Technica, BBC, Guardian, Wired, NPR, HN, Reuters, CNN, NYT (2026-04-18).
  • For every site that has a feed, httpget + XML parsing is faster and more reliable than a browser. Use ThreadPoolExecutor for parallel fetches.
  • Confirmed working RSS feeds (tested):
  • Sequential fetch of 7 feeds: 0.70s. Parallel fetch of same 7 feeds: 0.16s (4.3x speedup).
Show full markdown

Field-tested against TechCrunch, The Verge, Ars Technica, BBC, Guardian, Wired, NPR, HN, Reuters, CNN, NYT (2026-04-18).

Lead with RSS — fastest and most reliable

For every site that has a feed, http_get + XML parsing is faster and more reliable than a browser. Use ThreadPoolExecutor for parallel fetches.

Confirmed working RSS feeds (tested):

SourceFeed URLFormatItemsFetch time
TechCrunchhttps://techcrunch.com/feed/RSS 2.020~0.08s
Ars Technicahttps://feeds.arstechnica.com/arstechnica/indexRSS 2.020~0.10s
BBC Newshttp://feeds.bbci.co.uk/news/rss.xmlRSS 2.037~0.23s
The Guardian (World)https://www.theguardian.com/world/rssRSS 2.045~0.11s
The Guardian (Tech)https://www.theguardian.com/technology/rssRSS 2.032~0.25s
Wiredhttps://www.wired.com/feed/rssRSS 2.050~0.10s
NPR Top Storieshttps://feeds.npr.org/1001/rss.xmlRSS 2.010~0.14s
Hacker Newshttps://news.ycombinator.com/rssRSS 2.030~0.16s
CNN Top Storieshttp://rss.cnn.com/rss/cnn_topstories.rssRSS 2.069~0.25s
NYT Homepagehttps://rss.nytimes.com/services/xml/rss/nyt/HomePage.xmlRSS 2.023~0.12s
The Vergehttps://www.theverge.com/rss/index.xmlAtom10~0.15s

Parallel fetch pattern (4.3x speedup measured)

Sequential fetch of 7 feeds: 0.70s. Parallel fetch of same 7 feeds: 0.16s (4.3x speedup).

python
from concurrent.futures import ThreadPoolExecutor
import xml.etree.ElementTree as ET

RSS_FEEDS = [
    ("TechCrunch",     "https://techcrunch.com/feed/"),
    ("Ars Technica",   "https://feeds.arstechnica.com/arstechnica/index"),
    ("BBC News",       "http://feeds.bbci.co.uk/news/rss.xml"),
    ("Guardian World", "https://www.theguardian.com/world/rss"),
    ("Wired",          "https://www.wired.com/feed/rss"),
    ("NPR",            "https://feeds.npr.org/1001/rss.xml"),
    ("Wired",          "https://www.wired.com/feed/rss"),
]

def fetch_rss(name_url):
    name, url = name_url
    xml_data = http_get(url)
    root = ET.fromstring(xml_data)
    items = root.findall('.//item')
    return name, items

with ThreadPoolExecutor(max_workers=len(RSS_FEEDS)) as ex:
    results = list(ex.map(fetch_rss, RSS_FEEDS))

for name, items in results:
    for item in items[:5]:
        title = item.find('title').text
        link  = item.find('link').text
        print(f"[{name}] {title}")

The Verge requires Atom namespace parsing

The Verge's feed is Atom format, not RSS 2.0. The naive .//item selector returns 0 items. The title element uses type="html" attribute but its .text still contains the plain string.

python
import xml.etree.ElementTree as ET

NS = {'atom': 'http://www.w3.org/2005/Atom'}

xml_data = http_get("https://www.theverge.com/rss/index.xml")
root = ET.fromstring(xml_data)
entries = root.findall('.//atom:entry', NS)   # 10 entries

for e in entries:
    title = e.find('atom:title', NS).text
    link  = e.find('atom:link', NS).get('href')
    print(title, link)

Do NOT use root.findall('.//{http://www.w3.org/2005/Atom}entry') with a bare namespace — the explicit NS dict approach above is cleaner. Do NOT call .text on a find() result without checking for None first (the naive RSS path hit this on The Verge).

Sites that block http_get entirely

Reuters returns HTTP 403/Forbidden for all http_get calls, even with a real browser User-Agent header. Use browser fallback (see below).

code
Reuters: ERROR HTTP Error 401: HTTP Forbidden   # with AND without User-Agent

Reuters's old RSS feeds (feeds.reuters.com/reuters/topNews) resolve to DNS NXDOMAIN — they have been shut down.

Sites that work fine with http_get + User-Agent

NYT, Guardian, HN, CNN all return full HTML via http_get without issues. The User-Agent header (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36) is not required for these but doesn't hurt.

python
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"}
html = http_get("https://www.nytimes.com", headers=headers)  # 1.1MB, works
html = http_get("https://news.ycombinator.com")               # 34KB, works without UA

HN parsing via regex (no HTML parser needed):

python
import re
html = http_get("https://news.ycombinator.com")
stories = re.findall(r'class="titleline"><a href="([^"]+)"[^>]*>([^<]+)<', html)
# Returns list of (url, title) tuples — 30 stories on the front page

Browser extraction — use when RSS is unavailable

BBC (bbc.com/news)

No consent banner in headless browser (US region served; GDPR banner only appears for EU IP). Articles use article h2 selectors.

python
goto_url("https://www.bbc.com/news")
wait_for_load()
wait(2)

headlines = js("""
  Array.from(document.querySelectorAll('article h2'))
    .map(h => ({
      title: h.innerText.trim(),
      url: h.closest('a')?.href || h.closest('[href]')?.href || 
           h.parentElement.querySelector('a')?.href
    }))
    .filter(h => h.title.length > 10)
""")
# Returns 50+ articles. First one is typically LIVE/breaking.

If running from a EU IP and a consent banner appears:

python
accept = js("""
  var btns = Array.from(document.querySelectorAll('button'));
  var btn = btns.find(b => /accept all|agree|continue/i.test(b.innerText));
  if (btn) { btn.click(); return 'clicked: ' + btn.innerText; }
  return 'no banner';
""")

Confirmed: h3 elements on BBC are site-chrome labels ("The BBC is in multiple languages"), NOT article headlines. Use article h2 only.

TechCrunch (techcrunch.com)

article and .post-block selectors return 0 results — TechCrunch changed their layout. Articles are in h3 elements.

python
goto_url("https://techcrunch.com")
wait_for_load()
wait(2)

articles = js("""
  Array.from(document.querySelectorAll('h3'))
    .map(h => ({
      title: h.innerText?.trim(),
      url: h.closest('a')?.href || h.querySelector('a')?.href || 
           h.parentElement?.querySelector('a')?.href
    }))
    .filter(a => a.title && a.title.length > 20)
""")
# Returns ~10 articles. RSS is preferred (20 items, no JS required).

RSS is almost always faster for TechCrunch: 0.08s vs 3-5s browser load. Only fall back to browser if you need paywall/subscriber content.

Reuters (reuters.com)

http_get returns 403. Browser loads but the homepage is heavily JS-rendered with delayed hydration. h3 selectors only return nav elements after standard wait_for_load(). Use wait(3) plus scroll:

python
goto_url("https://www.reuters.com")
wait_for_load()
wait(3)
js("window.scrollTo(0, 500)")
wait(1)
# Category links work for topic nav:
links = js("""
  Array.from(document.querySelectorAll('a[href*="/world/"], a[href*="/technology/"]'))
    .filter(a => a.innerText.trim().length > 20)
    .map(a => ({text: a.innerText.trim(), href: a.href}))
""")

Reuters headlines are best obtained from the Guardian or AP — Reuters no longer has a public RSS and their JS hydration is slow.

Decision tree: which approach to use

code
Does the site have an RSS/Atom feed?
  YES → http_get + XML parse (fastest, ~0.1s per feed)
         - RSS 2.0: root.findall('.//item')
         - Atom:    root.findall('.//atom:entry', {'atom': 'http://www.w3.org/2005/Atom'})
  NO  → Does http_get return valid HTML (not 403/401/JS shell)?
          YES → http_get + regex/BeautifulSoup (fast, ~0.2-0.3s)
          NO  → goto + wait_for_load + wait(2) + js() extraction (slow, 3-8s)

What to skip

  • Reuters RSS — DNS dead (feeds.reuters.com is NXDOMAIN)
  • Reuters http_get — returns 403 regardless of User-Agent
  • TechCrunch article/.post-block selectors — layout changed, use h3 instead
  • BBC h3 for headlines — those are site-chrome labels; use article h2
  • The Verge .//item selector — feed is Atom, not RSS; use Atom namespace