Commit 34995fc1 authored by Stefan Karner's avatar Stefan Karner
Browse files

Complete slides for section4 - Overview Webarchive

parent 56191571
Loading
Loading
Loading
Loading
+119 −15
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# 4 - Webarchive

[https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)

[https://labs.onb.ac.at/dataset/webarchive/](https://labs.onb.ac.at/dataset/webarchive/)

%% Cell type:markdown id: tags:

#### In this block
### In this block

* Overview Webarchive
* Overview API
* Overview Content
* Overview API

%% Cell type:markdown id: tags:

* Example Wayback search via API
* Example full text search via API
* Example download preview SVG thumb of saved page

%% Cell type:markdown id: tags:

## Overview Webarchive

[https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)

%% Cell type:markdown id: tags:

#### What is the Webarchive Austria?
### What is the Webarchive Austria?

%% Cell type:markdown id: tags:

* Attempt to conserve online data for future generations
* Webarchive Austria crawls officially since March 2009
* All domains within `.at`, `.ac.at`, `.gv.at`, `.wien`, `.tirol`
* Selected other domains with 'Austrian content'
* About 2 million websites saved

%% Cell type:markdown id: tags:

#### What can I use?
### Who is the Webarchive Austria?

%% Cell type:markdown id: tags:

* Andreas Predikaka
* webarchiv@onb.ac.at

%% Cell type:markdown id: tags:

### What can I use?

%% Cell type:markdown id: tags:

* Websites: no public access
  * Access on premises at the ÖNB
  * Exception: onb.ac.at
* Metadata: public access
* Full text search: public access
  * Viewing the results of the full text search: no public access

%% Cell type:markdown id: tags:

#### What does that mean?
### What does that mean?

%% Cell type:markdown id: tags:

* Searching outside the ÖNB gives you URLs, doesn't give you page content

%% Cell type:markdown id: tags:

#### What if I really really need to see the content?
### What if I really really need to see the content?

%% Cell type:markdown id: tags:

* You can come to the ÖNB in person and use one of two offline computers...

%% Cell type:markdown id: tags:

* ...to `PRINT OUT THE INTERNET!`

%% Cell type:markdown id: tags:

![Office folders labeled 'Internet'](./media/internet-folders.jpg)

%% Cell type:markdown id: tags:

#### How is a search where I don't see detailed results useful to me?
### How is a search where I don't see detailed results useful to me?

%% Cell type:markdown id: tags:

* Sometimes the content is still online
* Sometimes the Internet Archive has a copy
* You can observe the emergence of certain terms ('Westbalkanroute', 'Soldatna')

%% Cell type:markdown id: tags:

* ???

%% Cell type:markdown id: tags:

## Overview API
## Overview Content

[https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)

%% Cell type:markdown id: tags:

### What's inside?

%% Cell type:code id: tags:
%% Cell type:markdown id: tags:

``` python
```
* High crawl frequency (daily or weekly)
  * Media sites (ORF)
  * Political parties
* Low crawl frequency (a few times per year)
  * Topic: Gender
  * Austrian domains (via nic.at)
* Event crawls (daily or weekly within a certain timespan)
  * Elections
  * Olympia
  * Refugee crisis 2015
  * Song Contest 2015

%% Cell type:markdown id: tags:

## Overview Content
### Can I have a list?

%% Cell type:markdown id: tags:

* Sure, there you go:
  * Media, political, gender: [https://webarchiv.onb.ac.at/data/selective.json](https://webarchiv.onb.ac.at/data/selective.json)
  * Events: [https://webarchiv.onb.ac.at/data/events.json](https://webarchiv.onb.ac.at/data/events.json)
  * All domains: **TODO: ADD LINK**

```json
[
  {
    "id": 37,
    "name": "Frau/Gender",
    "begin": "29.11.2016",
    "groups": [
      {
        "seeds": [
          "http://abtreibung.at/"
        ],
        "group_id": 1,
        "name": "Abtreibung.at"
      },
      {
        "seeds": [
          "http://aep.at"
        ],
        "group_id": 2,
        "name": "Arbeitskreis Emanzipation Partnerschaft"
      },```

%% Cell type:markdown id: tags:

### How big is the Austrian Webarchive?

%% Cell type:markdown id: tags:

* About 500GiB indexed text
* About 100 million HTML documents
* Raw data: 115.28TiB uncompressed

%% Cell type:markdown id: tags:

### Where's the catch?

%% Cell type:markdown id: tags:

* Social media is currently too hard to crawl
* Limited disk space necessitates a size limit per page
  * Ex: domain crawl 10MB -> 100MB -> 7GB
* Limitations of public access
  * Practically every webarchive except the Internet Archive

%% Cell type:markdown id: tags:

## Overview API

[https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)

%% Cell type:code id: tags:
%% Cell type:markdown id: tags:

### How can I access the Austrian Webarchive?

%% Cell type:markdown id: tags:

* On site at the ONB
* Online: [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)
* REST API: [https://webarchiv.onb.ac.at/api.html](https://webarchiv.onb.ac.at/api.html)
  * Swagger definition: [https://webarchiv.onb.ac.at/api/swagger.json](https://webarchiv.onb.ac.at/api/swagger.json)
* Python module for easier access: [https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/blob/master/webarchiv.py](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/blob/master/webarchiv.py) ([raw](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/raw/master/webarchiv.py?inline=false))

%% Cell type:markdown id: tags:

### Why is access via API useful?

%% Cell type:markdown id: tags:

* Individual searches may take up to 1 minute
* Sift through loads of metadata
* API-only goodies
  * Easily nominate pages with Austrian content to be saved
  * Download SVG thumbnails of rendered websites
* It's way more fun

%% Cell type:markdown id: tags:

``` python
```
* Make Andreas happy :)