AsyncScraper¶

The main entry point for IntelliScraper.

Async web scraper with rate limiting and concurrent page support.

The AsyncScraper class is the main entry point for IntelliScraper. It orchestrates browser backends, rate limiting, and page management to scrape one or many URLs concurrently.

Two browser modes are supported:

Local browser (use_local_browser=True): Connects to your running Chrome instance via CDP. All existing cookies and logins are available immediately.
Managed browser (default): Launches a fresh Chromium instance with fingerprint spoofing, proxy, and session data.

Example:

import asyncio
from intelliscraper import AsyncScraper, ScrapStatus

async def main():
    async with AsyncScraper(max_concurrent_pages=4) as scraper:
        response = await scraper.scrape("https://example.com")
        if response.status == ScrapStatus.SUCCESS:
            print(response.scrap_html_content)

asyncio.run(main())

class intelliscraper.scraper.AsyncScraper(headless=True, browser_launch_options={'args': ['--disable-blink-features=AutomationControlled', '--disable-dev-shm-usage', '--no-sandbox', '--disable-setuid-sandbox', '--disable-accelerated-2d-canvas', '--no-first-run', '--no-zygote', '--disable-gpu', '--disable-web-security'], 'chromium_sandbox': False, 'headless': False}, proxy=None, session_data=None, browsing_mode=None, max_concurrent_pages=1, use_local_browser=False, max_requests_per_minute=None)[source]¶

Bases: object

Async web scraper with rate limiting and concurrent page support.

Orchestrates browser backends, page pools, rate limiting, and human-like browsing behaviour to scrape URLs concurrently.

Parameters:

headless (bool) – Run browser without UI. Only applies in managed browser mode. Ignored when use_local_browser=True. Defaults to True.
browser_launch_options (dict) – Custom Chromium launch options. Only applies in managed browser mode. Defaults to BROWSER_LAUNCH_OPTIONS.
proxy (Proxy | ProxyProvider | None) – Proxy configuration or ProxyProvider instance. Only applies in managed browser mode. Defaults to None.
session_data (Session | None) – Pre-authenticated session with cookies, localStorage, sessionStorage, and browser fingerprint. Only applies in managed browser mode. Defaults to None.
browsing_mode (BrowsingMode | None) – Behaviour mode — FAST (no human simulation) or HUMAN_LIKE (scrolling, delays). Auto-determined if None. Defaults to None.
max_concurrent_pages (int) – Number of pages to use for concurrent scraping. Defaults to 1.
use_local_browser (bool) – If True, connect to an existing Chrome instance via CDP instead of launching a new browser. Defaults to False.
max_requests_per_minute (int | None) – Rate limit shared across all pages. Set to None or 0 to disable (default).

Note

__init__ only sets configuration. Call initialize() or use the async context manager to start the browser:

async with AsyncScraper() as scraper:
    result = await scraper.scrape(url)

async initialize()[source]¶

Initialise browser and create the page pool.

Dispatches to the configured backend to start the browser, then creates max_concurrent_pages pages and a semaphore.

Return type:: None

async close()[source]¶

Close browser and release all resources.

Behaviour depends on the backend:

Local browser: Only closes pages opened by the scraper. The Chrome process and context are left running.
Managed browser: Closes pages, context, and browser.

Return type:: None

async scrape(url, timeout=datetime.timedelta(seconds=30))[source]¶

Scrape content from a single URL.

The semaphore ensures only max_concurrent_pages requests run simultaneously. Pages are selected from the pool using round-robin. Rate limiting is applied before navigation.

Parameters:

url (str) – Target URL to scrape.
timeout (timedelta) – Maximum time to wait for page load. Defaults to 30 seconds.

Returns:

A ScrapeResponse with status, HTML content, HTTP status code, timing, and metadata.

Return type:

ScrapeResponse

Example:

async with AsyncScraper(max_concurrent_pages=4) as scraper:
    tasks = [
        scraper.scrape("https://example1.com"),
        scraper.scrape("https://example2.com"),
    ]
    results = await asyncio.gather(*tasks)

async batch_scrape(urls, timeout=datetime.timedelta(seconds=30))[source]¶

Scrape multiple URLs with rate limiting and concurrency control.

This is the recommended API for scraping large numbers of URLs. Rate limiting (via max_requests_per_minute) is applied before each request, and the page-pool semaphore controls concurrency.

Parameters:

urls (list[str]) – List of target URLs to scrape.
timeout (timedelta) – Maximum time per page load. Defaults to 30 seconds.

Returns:

List of ScrapeResponse objects, one per URL, in the same order as the input URLs.

Return type:

list[ScrapeResponse]

Example:

async with AsyncScraper(
    max_concurrent_pages=4,
    max_requests_per_minute=900,  # 15/sec
) as scraper:
    results = await scraper.batch_scrape(
        urls=[f"https://example.com/page/{i}" for i in range(100)]
    )
    for result in results:
        print(result.scrape_request.url, result.status)