In this block

  • Overview Webarchive
  • Overview Content
  • Overview API
  • Example: Interacting with the API
  • Example: Wayback search via API
  • Example: Full text search via API
  • Example: Download preview SVG thumb of saved page

Overview Webarchive

https://webarchiv.onb.ac.at

ÖNB Webarchive Terminal

What is the Webarchive Austria?

  • Attempt to conserve online data for future generations
  • Webarchive Austria crawls officially since March 2009
  • All domains within .at, .ac.at, .gv.at, .wien, .tirol
  • Selected other domains with 'Austrian content'
  • About 2 million websites saved

Who is the Webarchive Austria?

  • Andreas Predikaka
  • webarchiv@onb.ac.at

What can I use?

  • Websites: no public access
    • Access on premises at the ÖNB
    • Exception: onb.ac.at
  • Metadata: public access
  • Full text search: public access
    • Viewing the results of the full text search: no public access

What does that mean?

  • Searching outside the ÖNB gives you URLs, doesn't give you page content

What if I really really need to see the content?

  • You can come to the ÖNB in person and use one of two offline computers...
  • ...to PRINT OUT THE INTERNET!

Office folders labeled 'Internet'

How is a search where I don't see detailed results useful to me?

  • Sometimes the content is still online
  • Sometimes the Internet Archive has a copy
  • You can observe the emergence of certain terms ('Westbalkanroute', 'Soldatna')
  • ???

What's inside?

  • High crawl frequency (daily or weekly)
    • Media sites (ORF)
    • Political parties
  • Low crawl frequency (a few times per year)
    • Topic: Gender
    • Austrian domains (via nic.at)
  • Event crawls (daily or weekly within a certain timespan)
    • Elections
    • Olympia
    • Refugee crisis 2015
    • Song Contest 2015

Can I have a list?

[
  {
    "id": 37,
    "name": "Frau/Gender",
    "begin": "29.11.2016",
    "groups": [
      {
        "seeds": [
          "http://abtreibung.at/"
        ],
        "group_id": 1,
        "name": "Abtreibung.at"
      },
      {
        "seeds": [
          "http://aep.at"
        ],
        "group_id": 2,
        "name": "Arbeitskreis Emanzipation Partnerschaft"
      },

How big is the Austrian Webarchive?

  • About 500GiB indexed text
  • About 100 million HTML documents
  • Raw data: 115.28TiB uncompressed

Where's the catch?

  • Social media is currently too hard to crawl
  • Limited disk space necessitates a size limit per page
    • Ex: domain crawl 10MB -> 100MB -> 7GB
  • Limitations of public access
    • Practically every webarchive except the Internet Archive

How can I access the Austrian Webarchive?

Why is access via API useful?

  • Individual searches may take up to 1 minute
  • Sift through loads of metadata
  • API-only goodies
    • Easily nominate pages with Austrian content to be saved
    • Download SVG thumbnails of rendered websites
  • It's way more fun
  • Make Andreas happy :)

Tracking

We'd love to properly count the unique visitors in the Webarchive backend, so we kindly ask you to opt in to tracking by instantiating webarchiv.WebarchivSession with the parameter allow_tracking=True.

This sends your SHA256-hashed MAC address as a fingerprint to the server on authentication. It is only ever used to count unique users.

If you leave allow_tracking at the default value False, an empty string is sent as fingerprint.