Extractor API¶
Extraction¶
wpextract.WPExtractor
¶
WPExtractor(
json_root: Path,
scrape_root: Optional[Path] = None,
json_prefix: Optional[str] = None,
translation_pickers: Optional[PickerListType] = None,
)
Manages the extraction of data from a WordPress site.
| PARAMETER | DESCRIPTION |
|---|---|
json_root |
Path to directory of JSON files
TYPE:
|
scrape_root |
Path to scrape directory |
json_prefix |
Prefix of files in |
translation_pickers |
Supply a custom list of translation pickers
TYPE:
|
link_registry
instance-attribute
¶
link_registry: LinkRegistry = LinkRegistry()
Registry of known URLs and their corresponding data items.
Extraction Data¶
wpextract.extractors.data.links.Link
dataclass
¶
A link to a URL.
wpextract.extractors.data.links.LinkRegistry
¶
A collection of all known links on the site.
wpextract.extractors.data.links.LinkRegistry.add_linkable
¶
Add a single linkable item to the registry.
The URL will be compared later against a list of links that need to be resolved and the data type and IDX will be returned.
Data types should be unique. IDXes should be unique within one or more data types.
| PARAMETER | DESCRIPTION |
|---|---|
url |
The URL of the destination
TYPE:
|
data_type |
A unique identifier for this type of item.
TYPE:
|
idx |
A unique identifier within the data type.
TYPE:
|
_refresh_cache |
Whether the link cache should be updated. Should be left as True unless multiple links are being added together.
TYPE:
|
wpextract.extractors.data.links.LinkRegistry.add_linkables
¶
Add multiple linkable items at once.
| PARAMETER | DESCRIPTION |
|---|---|
data_type |
The data type for all items.
TYPE:
|
links |
A list of links. Must be the same length as idxes. |
idxes |
A list of IDs. Must be the same length as links. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
if the links and idxes lists are not the same length. |
wpextract.extractors.data.links.Linkable
dataclass
¶
An item which can be linked to.
Multilingual Extraction¶
wpextract.parse.translations.LangPicker
¶
LangPicker(page_doc: BeautifulSoup)
Bases: ABC
Abstract class of a language picker style.
Support for a new language picker can be added by creating a new class inheriting from this one.
See Also
- Creating a new picker guide
| PARAMETER | DESCRIPTION |
|---|---|
page_doc |
The document to extract a language picker from.
TYPE:
|
current_language
instance-attribute
¶
The current language of the page, populated by calling LangPicker.set_current_lang within LangPicker.extract.
page_doc
instance-attribute
¶
page_doc: BeautifulSoup = page_doc
The document to extract the language picker from.
root_el
instance-attribute
¶
root_el: Tag
The root element of the language picker, populated if LangPicker.matches is succesful.
translations
instance-attribute
¶
translations: list[TranslationLink] = []
A list of translation links, populated by calling LangPicker.add_translation within LangPicker.extract.
add_translation
¶
extract
abstractmethod
¶
Extract the current language and translations from the doc.
Instead of directly selecting on root_el, consider using the helper methods
_root_select and
_root_select_one
to extract elements. These are the equivalent of directly calling select or select_one, but
will raise a formatted error if the element is not found. Don't use these methods if no results is an expected
outcome, e.g. a post may have no translations.
If using other selectors, you can construct the exception using the helper
_build_extraction_fail_err.
| RAISES | DESCRIPTION |
|---|---|
ExtractionFailedError
|
If the picker is unable to find an element it expects to be present. |
get_root
abstractmethod
¶
Retrieve the root element of the translation picker.
Using the LangPicker.page_doc attribute (a bs4.BeautifulSoup object representing the whole page), the root element of the picker shoudl be found and returned.
| RETURNS | DESCRIPTION |
|---|---|
Optional[Tag]
|
The root element, or None if this picker is not found on the page. |
set_current_lang
¶
set_current_lang(lang: str) -> None
Set the language of this doc.
| PARAMETER | DESCRIPTION |
|---|---|
lang |
The locale string
TYPE:
|
_root_select_one
¶
Helper to extract an element from the root element.
| PARAMETER | DESCRIPTION |
|---|---|
selector |
a CSS selector to be passed to
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ExtractionFailedError
|
If the element was not found. This indicates this picker was activated when it should not have been. |
| RETURNS | DESCRIPTION |
|---|---|
Tag
|
The element found by the selector. |
_root_select
¶
Helper to extract elements from the root element.
| PARAMETER | DESCRIPTION |
|---|---|
selector |
a CSS selector to be passed to
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ExtractionFailedError
|
If no matching elements were found. This indicates this picker was activated when it should not have been. |
| RETURNS | DESCRIPTION |
|---|---|
Sequence[Tag]
|
The elements found by the selector. |
_build_extraction_fail_err
¶
_build_extraction_fail_err(
selector: str,
) -> ExtractionFailedError
Create an error for when an expected element is missing.
| PARAMETER | DESCRIPTION |
|---|---|
selector |
a string describing the attempted selection criteria
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ExtractionFailedError
|
An instance of the exception to be raised. |
wpextract.parse.translations.PickerListType
module-attribute
¶
PickerListType = list[type[LangPicker]]
wpextract.parse.translations.TranslationLink
dataclass
¶
TranslationLink(
text: Optional[str],
href: Optional[str],
destination: Optional[Linkable],
lang: Union[str, Language],
)
Bases: ResolvableLink
A link to an alternative version of this article in a different language.