autoextract-poet documentation¶
autoextract-poet
contains the common item definitions. Such items can
be extracted automatically using Zyte AutoExtract API (you can use
scrapy-poet and scrapy-autoextract for this).
The AutoExtract API is able to convert pages into data automatically. It support multiple types of pages, like articles, products, real estate, comments, job posting, reviews, etc. See the full list of supported page types here.
See also web-poet for an introduction about the Page Objects paradigm and the scrapy-poet tutorial for an introduction about how to use Page Objects with Scrapy spiders.
License is BSD 3-clause.
Introduction¶
Installing autoextract-poet¶
autoextract-poet
is a regular PyPI package that can be installed
using pip
: pip install autoextract-poet
. It is also a dependency
of scrapy-autoextract, and installed automatically
if you use scrapy-autoextract.
Basic usage¶
You can use items defined by autoextract-poet just as regular Python objects,
to standardize item definitions. They are implemented as attr.s
classes, and
can be used as Scrapy items directly, or converted
to dictionaries (e.g. for serialization) via itemadapter. The full list of
items can be seen here autoextract_poet.items
.
scrapy-autoextract provides an automatic way to extract items defined here from any website, using Scrapy and Autoextract API. See its scrapy-autoextract documentation for more.
Compatibility with new fields added to the API¶
Eventually, some new fields could be added to the Autoextract API.
When you’re creating autoextract-poet
items from Autoextract responses,
the library would ignore unknown fields by default,
until you upgrade the library to a version containing the new field.
But you might want to keep the unknown (new) fields even if you don’t update
the autoextract-poet
library.
If you’re using Scrapy (or itemadapter), you can make these unknown
attributes exposed in the output by registering
AutoExtractAdapter
in itemadapter’s ADAPTER_CLASSES:
from autoextract_poet import AutoExtractAdapter
from itemadapter import ItemAdapter
ItemAdapter.ADAPTER_CLASSES.appendleft(AutoExtractAdapter)
For example, you can put this code to settings.py of your Scrapy project.
API Reference¶
Contributing¶
autoextract-poet is an open-source project. Your contribution is very welcome!
Issue Tracker¶
If you have a bug report, a new feature proposal or simply would like to make a question, please check our issue tracker on Github: https://github.com/scrapinghub/autoextract-poet/issues
Source code¶
Our source code is hosted on Github: https://github.com/scrapinghub/autoextract-poet
Before opening a pull request, it might be worth checking current and previous issues. Some code changes might also require some discussion before being accepted so it might be worth opening a new issue before implementing huge or breaking changes.
Changelog¶
TBR¶
Support for all API page types at the moment
Introduction of
_unknown_fields_dict
andAutoExtractAdapter
. Allows to extend items with custom attributes and to include in the output the returned attributes not yet supported by the existing definitions.Initial documentation
0.2.2 (2021-05-29)¶
Page classes for Article, Product and ProductList introduced
0.2.1 (2021-01-27)¶
AdditionalProperty
value as optional to matchunified-schema
0.2.0 (2020-12-30)¶
AutoExtractProductListData
page input andProductList
itemfrom_dict
of items no longer fail on unknown attributes, they’re ignored nowList attributes now default to
[]
instead ofNone
CI is switched to github actions
Python 3.9 is added to CI
0.1.0 (2020-11-19)¶
AutoExtractHtml
page inputAutoExtractWebPage
andAutoExtractItemWebPage
base page objects
0.0.1 (2020-08-18)¶
Initial release.
Article and Product page inputs
Article and Product items (and their dependencies)
License¶
Copyright (c) Zyte Group Ltd All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of Zyte nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.