autoextract-poet documentation

autoextract-poet contains the common item definitions. Such items can be extracted automatically using Zyte AutoExtract API (you can use scrapy-poet and scrapy-autoextract for this).

The AutoExtract API is able to convert pages into data automatically. It support multiple types of pages, like articles, products, real estate, comments, job posting, reviews, etc. See the full list of supported page types here.

See also web-poet for an introduction about the Page Objects paradigm and the scrapy-poet tutorial for an introduction about how to use Page Objects with Scrapy spiders.

License is BSD 3-clause.

Introduction

Installing autoextract-poet

autoextract-poet is a regular PyPI package that can be installed using pip: pip install autoextract-poet. It is also a dependency of scrapy-autoextract, and installed automatically if you use scrapy-autoextract.

Basic usage

You can use items defined by autoextract-poet just as regular Python objects, to standardize item definitions. They are implemented as attr.s classes, and can be used as Scrapy items directly, or converted to dictionaries (e.g. for serialization) via itemadapter. The full list of items can be seen here autoextract_poet.items.

scrapy-autoextract provides an automatic way to extract items defined here from any website, using Scrapy and Autoextract API. See its scrapy-autoextract documentation for more.

Compatibility with new fields added to the API

Eventually, some new fields could be added to the Autoextract API. When you’re creating autoextract-poet items from Autoextract responses, the library would ignore unknown fields by default, until you upgrade the library to a version containing the new field. But you might want to keep the unknown (new) fields even if you don’t update the autoextract-poet library.

If you’re using Scrapy (or itemadapter), you can make these unknown attributes exposed in the output by registering AutoExtractAdapter in itemadapter’s ADAPTER_CLASSES:

from autoextract_poet import AutoExtractAdapter
from itemadapter import ItemAdapter
ItemAdapter.ADAPTER_CLASSES.appendleft(AutoExtractAdapter)

For example, you can put this code to settings.py of your Scrapy project.

API Reference

autoextract_poet

Contributing

autoextract-poet is an open-source project. Your contribution is very welcome!

Issue Tracker

If you have a bug report, a new feature proposal or simply would like to make a question, please check our issue tracker on Github: https://github.com/scrapinghub/autoextract-poet/issues

Source code

Our source code is hosted on Github: https://github.com/scrapinghub/autoextract-poet

Before opening a pull request, it might be worth checking current and previous issues. Some code changes might also require some discussion before being accepted so it might be worth opening a new issue before implementing huge or breaking changes.

Testing

We use tox to run tests with different Python versions:

tox

The command above also runs type checks; we use mypy.

Changelog

TBR

  • Support for all API page types at the moment

  • Introduction of _unknown_fields_dict and AutoExtractAdapter. Allows to extend items with custom attributes and to include in the output the returned attributes not yet supported by the existing definitions.

  • Initial documentation

0.2.2 (2021-05-29)

  • Page classes for Article, Product and ProductList introduced

0.2.1 (2021-01-27)

  • AdditionalProperty value as optional to match unified-schema

0.2.0 (2020-12-30)

  • AutoExtractProductListData page input and ProductList item

  • from_dict of items no longer fail on unknown attributes, they’re ignored now

  • List attributes now default to [] instead of None

  • CI is switched to github actions

  • Python 3.9 is added to CI

0.1.0 (2020-11-19)

  • AutoExtractHtml page input

  • AutoExtractWebPage and AutoExtractItemWebPage base page objects

0.0.1 (2020-08-18)

Initial release.

  • Article and Product page inputs

  • Article and Product items (and their dependencies)

License

Copyright (c) Zyte Group Ltd All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  3. Neither the name of Zyte nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.