Commit 5e828a4c authored by Stefan Karner's avatar Stefan Karner
Browse files

Add link to webarchive-api to webarchive overview

parent 6416392b
Loading
Loading
Loading
Loading
+2 −0
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# 4 - Webarchive

[https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)

[https://labs.onb.ac.at/dataset/webarchive/](https://labs.onb.ac.at/dataset/webarchive/)

[https://labs.onb.ac.at/gitlab/labs-team/webarchive-api](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api)

%% Cell type:markdown id: tags:

### In this block

* Overview Webarchive
* Overview Content
* Overview API

%% Cell type:markdown id: tags:

* Example: Interacting with the API
* Example: Wayback search via API
* Example: Full text search via API
* Example: Download preview SVG thumb of saved page

%% Cell type:markdown id: tags:

## Overview Webarchive

[https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)

%% Cell type:markdown id: tags:

![ÖNB Webarchive Terminal](https://webarchiv.onb.ac.at/web/20170925041718/https://webarchiv.onb.ac.at/img/webarchiv_terminal1.jpg)

%% Cell type:markdown id: tags:

### What is the Webarchive Austria?

%% Cell type:markdown id: tags:

* Attempt to conserve online data for future generations
* Webarchive Austria crawls officially since March 2009
* All domains within `.at`, `.ac.at`, `.gv.at`, `.wien`, `.tirol`
* Selected other domains with 'Austrian content'
* About 2 million websites saved

%% Cell type:markdown id: tags:

### Who is the Webarchive Austria?

%% Cell type:markdown id: tags:

* Andreas Predikaka
* webarchiv@onb.ac.at

%% Cell type:markdown id: tags:

### What can I use?

%% Cell type:markdown id: tags:

* Websites: no public access
  * Access on premises at the ÖNB
  * Exception: onb.ac.at
* Metadata: public access
* Full text search: public access
  * Viewing the results of the full text search: no public access

%% Cell type:markdown id: tags:

### What does that mean?

%% Cell type:markdown id: tags:

* Searching outside the ÖNB gives you URLs, doesn't give you page content

%% Cell type:markdown id: tags:

### What if I really really need to see the content?

%% Cell type:markdown id: tags:

* You can come to the ÖNB in person and use one of two offline computers...

%% Cell type:markdown id: tags:

* ...to `PRINT OUT THE INTERNET!`

%% Cell type:markdown id: tags:

![Office folders labeled 'Internet'](./media/internet-folders.jpg)

%% Cell type:markdown id: tags:

### How is a search where I don't see detailed results useful to me?

%% Cell type:markdown id: tags:

* Sometimes the content is still online
* Sometimes the Internet Archive has a copy
* You can observe the emergence of certain terms ('Westbalkanroute', 'Soldatna')

%% Cell type:markdown id: tags:

* ???

%% Cell type:markdown id: tags:

## Overview Content

[https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)

%% Cell type:markdown id: tags:

### What's inside?

%% Cell type:markdown id: tags:

* High crawl frequency (daily or weekly)
  * Media sites (ORF)
  * Political parties
* Low crawl frequency (a few times per year)
  * Topic: Gender
  * Austrian domains (via nic.at)
* Event crawls (daily or weekly within a certain timespan)
  * Elections
  * Olympia
  * Refugee crisis 2015
  * Song Contest 2015

%% Cell type:markdown id: tags:

### Can I have a list?

%% Cell type:markdown id: tags:

* Sure, there you go:
  * Media, political, gender: [https://webarchiv.onb.ac.at/data/selective.json](https://webarchiv.onb.ac.at/data/selective.json)
  * Events: [https://webarchiv.onb.ac.at/data/events.json](https://webarchiv.onb.ac.at/data/events.json)
  * All domains: [https://webarchiv.onb.ac.at/data/domainnames.json](https://webarchiv.onb.ac.at/data/domainnames.json)

```json
[
  {
    "id": 37,
    "name": "Frau/Gender",
    "begin": "29.11.2016",
    "groups": [
      {
        "seeds": [
          "http://abtreibung.at/"
        ],
        "group_id": 1,
        "name": "Abtreibung.at"
      },
      {
        "seeds": [
          "http://aep.at"
        ],
        "group_id": 2,
        "name": "Arbeitskreis Emanzipation Partnerschaft"
      },```

%% Cell type:markdown id: tags:

### How big is the Austrian Webarchive?

%% Cell type:markdown id: tags:

* About 500GiB indexed text
* About 100 million HTML documents
* Raw data: 115.28TiB uncompressed

%% Cell type:markdown id: tags:

### Where's the catch?

%% Cell type:markdown id: tags:

* Social media is currently too hard to crawl
* Limited disk space necessitates a size limit per page
  * Ex: domain crawl 10MB -> 100MB -> 7GB
* Limitations of public access
  * Practically every webarchive except the Internet Archive

%% Cell type:markdown id: tags:

## Overview API

[https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)

%% Cell type:markdown id: tags:

### How can I access the Austrian Webarchive?

%% Cell type:markdown id: tags:

* On site at the ONB
* Online: [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)
* REST API: [https://webarchiv.onb.ac.at/api.html](https://webarchiv.onb.ac.at/api.html)
  * Swagger definition: [https://webarchiv.onb.ac.at/api/swagger.json](https://webarchiv.onb.ac.at/api/swagger.json)
* Python module for easier access: [https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/blob/master/webarchiv.py](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/blob/master/webarchiv.py) ([raw](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/raw/master/webarchiv.py?inline=false))

%% Cell type:markdown id: tags:

### Why is access via API useful?

%% Cell type:markdown id: tags:

* Individual searches may take up to 1 minute
* Sift through loads of metadata
* API-only goodies
  * Easily nominate pages with Austrian content to be saved
  * Download SVG thumbnails of rendered websites
* It's way more fun

%% Cell type:markdown id: tags:

* Make Andreas happy :)

%% Cell type:markdown id: tags:

## Tracking

%% Cell type:markdown id: tags:

<img src="https://purepng.com/public/uploads/large/purepng.com-cute-dog-whelpdogdoggycutehoundwhelpbrownbegging-451520332429trejj.png" style="max-height:200px;" />

We'd love to properly count the unique visitors in the Webarchive backend, so we kindly ask you to **opt in to tracking by instantiating `webarchiv.WebarchivSession` with the parameter `allow_tracking=True`**.

This sends your SHA256-hashed MAC address as a fingerprint to the server on authentication. It is only ever used to count unique users.

If you leave `allow_tracking` at the default value `False`, an empty string is sent as fingerprint.
+1 −0
Original line number Diff line number Diff line
@@ -13118,6 +13118,7 @@ div#notebook {
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="4---Webarchive">4 - Webarchive<a class="anchor-link" href="#4---Webarchive">&#182;</a></h1><p><a href="https://webarchiv.onb.ac.at">https://webarchiv.onb.ac.at</a></p>
<p><a href="https://labs.onb.ac.at/dataset/webarchive/">https://labs.onb.ac.at/dataset/webarchive/</a></p>
<p><a href="https://labs.onb.ac.at/gitlab/labs-team/webarchive-api">https://labs.onb.ac.at/gitlab/labs-team/webarchive-api</a></p>

</div>
</div>
+13719 −0

File added.

Preview size limit exceeded, changes collapsed.