Content Parsers¶
IntelliScraper provides an extensible parser hierarchy for extracting structured data from scraped HTML.
Parser Hierarchy¶
BaseParser (ABC)
└── HTMLParser (general purpose)
└── YourCustomParser (extend for your site)
HTMLParser¶
The default parser for extracting text, links, and Markdown from any HTML page.
from intelliscraper.parsers import HTMLParser
parser = HTMLParser(url="https://example.com", html=html_content)
# Plain text
print(parser.text)
# All absolute URLs from <a> tags
print(parser.links)
# Classified internal/external links with metadata
print(parser.navigable_links)
# Full Markdown
print(parser.markdown)
# Cleaned Markdown optimised for LLM input
print(parser.markdown_for_llm)
Properties¶
Property |
Type |
Description |
|---|---|---|
|
|
Plain text with newline separators |
|
|
Deduplicated absolute HTTP/HTTPS URLs |
|
|
Classified links with href, text, title, link_type, rel |
|
|
Full-page Markdown (standard preprocessing) |
|
|
Cleaned Markdown (nav, ads, forms stripped) |
All properties are lazily computed and cached on first access.
Creating Custom Parsers¶
Extend HTMLParser to add site-specific extraction logic:
from functools import cached_property
from intelliscraper.parsers import HTMLParser
class ProductPageParser(HTMLParser):
"""Parser for an e-commerce product page."""
@cached_property
def product_title(self) -> str | None:
tag = self.soup.select_one("h1.product-title")
return tag.get_text(strip=True) if tag else None
@cached_property
def price(self) -> str | None:
tag = self.soup.select_one("span.price")
return tag.get_text(strip=True) if tag else None
@cached_property
def description(self) -> str | None:
tag = self.soup.select_one("div.product-description")
return tag.get_text(separator="\n", strip=True) if tag else None
Usage¶
async with AsyncScraper() as scraper:
response = await scraper.scrape("https://shop.example.com/product/123")
if response.status == ScrapStatus.SUCCESS:
parser = ProductPageParser(
url=response.scrape_request.url,
html=response.scrap_html_content,
)
print(f"Title: {parser.product_title}")
print(f"Price: {parser.price}")