Webarchive Austria

Starting March 2009 the webarchive Austria archives the “Austrian Webspace” https://webarchiv.onb.ac.at. The data is collected via regular domain, thematic and event-based crawls. These crawls include all .at, ac.at and gv.at domains, all .wien and .tirol domains and further websites with Austrian content. The data set includes metadata to selected domains within topical collections. The webarchive API allows to search URLs and partial full text searches.

The webarchive Austria includes more than 2 million websites. The metadata is licensed under Creative Commons Zero Lizenz (CC0).

Tools & Experiments

Jupyter Notebooks

Examples for using the webarchive API in Python

Data

Currently the following data is accessible:

	Description	Link
Selective Crawls	Basis for the webarchive collection “Laufende Crawls”	/data/selective.json
Event Crawls	Basis for the webarchive collection “Event Crawls”	/data/events.json
Other Web Archives	Links to other wayback machines - accept queries using the same format as the Webarchive Austria	/data/otherwebarchives.json
Object Count	Number of objects currently in the webarchive	/data/objectcount.json

Code

APIs and modules

	Description	Link
API description	API description using Swagger - swagger.json	Swagger Tool
Python binding	Python module for using the webarchive API	webarchiv.py

Tutorials

Instructive Jupyter Notebooks

	Description	Link
Notebook Selective	Extract all Seeds from a selective crawl	sample1.ipynb
Notebook Wayback Search	Search for all Captures of a URL and process the results	sample2.ipynb
Notebook Text Search	Search within the webarchive’s text and process the results	sample3.ipynb
Notebook Combined Search	Wayback search all URLs of a selective crawl	sample4.ipynb