Getting Started

Installation

# Install the package
pip install intelliscraper-core

# Install Playwright browser (Chromium)
playwright install chromium

Note

Playwright requires browser binaries installed separately. The command above installs Chromium, which is required for IntelliScraper.

Basic Scraping

import asyncio
from intelliscraper import AsyncScraper, ScrapStatus

async def main():
    async with AsyncScraper() as scraper:
        response = await scraper.scrape("https://example.com")

        if response.status == ScrapStatus.SUCCESS:
            print(f"HTTP {response.http_status_code}")
            print(f"Time: {response.elapsed_time:.2f}s")
            print(response.scrap_html_content[:500])

asyncio.run(main())

Session-Based Scraping

For sites that require authentication:

1. Capture a Session

intelliscraper-session \
    --url "https://example.com" \
    --site "example" \
    --output "./example_session.json"

This opens a browser — log in, then press Enter. Session data (cookies, localStorage, fingerprint) is saved to JSON.

2. Use the Session

import asyncio
import json
from intelliscraper import AsyncScraper, Session, ScrapStatus

async def main():
    with open("example_session.json") as f:
        session = Session(**json.load(f))

    async with AsyncScraper(session_data=session) as scraper:
        response = await scraper.scrape("https://example.com/dashboard")

        if response.status == ScrapStatus.SUCCESS:
            print(f"Session: {response.session_id}")
            print(response.scrap_html_content[:500])

asyncio.run(main())

Important

Sessions maintain internal time-series statistics (timestamps, statuses). These help analyse rate limits and performance. Excessive concurrency may cause failures — scale gradually.

Response Model

Every scrape() and batch_scrape() call returns a ScrapeResponse:

Field

Type

Description

scrape_request

ScrapeRequest

Original request parameters

status

ScrapStatus

Outcome: SUCCESS, PARTIAL_SUCCESS, FAILED, RATE_LIMITED, BLOCKED, TIMEOUT

http_status_code

int | None

HTTP status from the server (200, 403, 429, etc.)

elapsed_time

float | None

Scrape duration in seconds

scrap_html_content

str | None

Raw HTML from the page

error_msg

str | None

Error message on failure

session_id

str | None

Session site identifier

browser_mode

str | None

"local_browser" or "managed_browser"