AsyncScraper¶
The main entry point for IntelliScraper.
Async web scraper with rate limiting and concurrent page support.
The AsyncScraper class is the main entry point for IntelliScraper.
It orchestrates browser backends, rate limiting, and page management
to scrape one or many URLs concurrently.
Two browser modes are supported:
Local browser (
use_local_browser=True): Connects to your running Chrome instance via CDP. All existing cookies and logins are available immediately.Managed browser (default): Launches a fresh Chromium instance with fingerprint spoofing, proxy, and session data.
Example:
import asyncio
from intelliscraper import AsyncScraper, ScrapStatus
async def main():
async with AsyncScraper(max_concurrent_pages=4) as scraper:
response = await scraper.scrape("https://example.com")
if response.status == ScrapStatus.SUCCESS:
print(response.scrap_html_content)
asyncio.run(main())
- class intelliscraper.scraper.AsyncScraper(headless=True, browser_launch_options={'args': ['--disable-blink-features=AutomationControlled', '--disable-dev-shm-usage', '--no-sandbox', '--disable-setuid-sandbox', '--disable-accelerated-2d-canvas', '--no-first-run', '--no-zygote', '--disable-gpu', '--disable-web-security'], 'chromium_sandbox': False, 'headless': False}, proxy=None, session_data=None, browsing_mode=None, max_concurrent_pages=1, use_local_browser=False, max_requests_per_minute=None)[source]¶
Bases:
objectAsync web scraper with rate limiting and concurrent page support.
Orchestrates browser backends, page pools, rate limiting, and human-like browsing behaviour to scrape URLs concurrently.
- Parameters:
headless (bool) – Run browser without UI. Only applies in managed browser mode. Ignored when
use_local_browser=True. Defaults toTrue.browser_launch_options (dict) – Custom Chromium launch options. Only applies in managed browser mode. Defaults to
BROWSER_LAUNCH_OPTIONS.proxy (Proxy | ProxyProvider | None) – Proxy configuration or
ProxyProviderinstance. Only applies in managed browser mode. Defaults toNone.session_data (Session | None) – Pre-authenticated session with cookies, localStorage, sessionStorage, and browser fingerprint. Only applies in managed browser mode. Defaults to
None.browsing_mode (BrowsingMode | None) – Behaviour mode —
FAST(no human simulation) orHUMAN_LIKE(scrolling, delays). Auto-determined ifNone. Defaults toNone.max_concurrent_pages (int) – Number of pages to use for concurrent scraping. Defaults to
1.use_local_browser (bool) – If
True, connect to an existing Chrome instance via CDP instead of launching a new browser. Defaults toFalse.max_requests_per_minute (int | None) – Rate limit shared across all pages. Set to
Noneor0to disable (default).
Note
__init__only sets configuration. Callinitialize()or use the async context manager to start the browser:async with AsyncScraper() as scraper: result = await scraper.scrape(url)
- async initialize()[source]¶
Initialise browser and create the page pool.
Dispatches to the configured backend to start the browser, then creates
max_concurrent_pagespages and a semaphore.- Return type:
None
- async close()[source]¶
Close browser and release all resources.
Behaviour depends on the backend:
Local browser: Only closes pages opened by the scraper. The Chrome process and context are left running.
Managed browser: Closes pages, context, and browser.
- Return type:
None
- async scrape(url, timeout=datetime.timedelta(seconds=30))[source]¶
Scrape content from a single URL.
The semaphore ensures only
max_concurrent_pagesrequests run simultaneously. Pages are selected from the pool using round-robin. Rate limiting is applied before navigation.- Parameters:
- Returns:
A
ScrapeResponsewith status, HTML content, HTTP status code, timing, and metadata.- Return type:
Example:
async with AsyncScraper(max_concurrent_pages=4) as scraper: tasks = [ scraper.scrape("https://example1.com"), scraper.scrape("https://example2.com"), ] results = await asyncio.gather(*tasks)
- async batch_scrape(urls, timeout=datetime.timedelta(seconds=30))[source]¶
Scrape multiple URLs with rate limiting and concurrency control.
This is the recommended API for scraping large numbers of URLs. Rate limiting (via
max_requests_per_minute) is applied before each request, and the page-pool semaphore controls concurrency.- Parameters:
- Returns:
List of
ScrapeResponseobjects, one per URL, in the same order as the input URLs.- Return type:
Example:
async with AsyncScraper( max_concurrent_pages=4, max_requests_per_minute=900, # 15/sec ) as scraper: results = await scraper.batch_scrape( urls=[f"https://example.com/page/{i}" for i in range(100)] ) for result in results: print(result.scrape_request.url, result.status)