Custom Content Extraction Plugins

Magellan can extract content from nearly any site it spiders. However, often only a small part of the site is relevant to searchers. By only indexing the useful content from a page, whoosh’s index can be reduced in size, searches will perform faster and result in fewer erroneous results.

This is where content extractors come in. Subclass the following class, overriding methods as needed.

class magellan.extractor.BaseExtractor(content)[source]

Used to extract titles, content and urls from pages crawled in this profile.

static can_handle_url(url, opener)[source]

Determines whether a given url can be handled by this extractor. Can make deductions based on the url itself, or can use the url opener to examine headers.

classmethod clean_urls(urls)[source]
content_type = None
classmethod fix_url(url)[source]

Clean up urls with /../ or /./ in them, as well as other minor tweaks. This fixes them, popping off both the .. and the path component above it, and removes . entirely.

get_content()[source]

Returns the content of the document in a format suitable for indexing. By default, strips html tags extra whitespace. Override to strip out more superfluous content, such as sidebars, headers, footers, etc.

get_headings()[source]

Headings are indexed an additional time from normal content, as these are likely important clues to the document’s content. Override if headings are not just h1, h2 or h3 tags.

get_title()[source]

Returns the title from the document’s content. Override to trim title or otherwise mutate the title. Used by the indexer when adding documents to the search index.

classmethod get_urls(content)[source]
soup = None
strip_by_classes(classes)[source]

A helper method for trimming content. Removes elements from the soup that match any class in the provided list

strip_by_ids(ids)[source]

A helper method for trimming content. Removes elements from the HTML content that match any id in the provided list

strip_doctype_and_comments()[source]

Removes doctype and HTML comments from HTML content.

strip_script()[source]

Removes all script tags from html content.

strip_style()[source]

Removes all style tags from html content.

strip_whitespace(content)[source]

Returns content with duplicate whitespace converted to single spaces.

Project Versions

Previous topic

Installation

Next topic

Internals

This Page