diff --git a/4 - Webarchive.ipynb b/4 - Webarchive.ipynb index ecfa976d442e4cffd4271756d7bf176a2d3a17b3..08be5fb8196a623f7ca04d4fa67d2cfdb557760a 100644 --- a/4 - Webarchive.ipynb +++ b/4 - Webarchive.ipynb @@ -12,7 +12,9 @@ "\n", "[https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)\n", "\n", - "[https://labs.onb.ac.at/dataset/webarchive/](https://labs.onb.ac.at/dataset/webarchive/)" + "[https://labs.onb.ac.at/dataset/webarchive/](https://labs.onb.ac.at/dataset/webarchive/)\n", + "\n", + "[https://labs.onb.ac.at/gitlab/labs-team/webarchive-api](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api)" ] }, { diff --git a/html-versions/4 - Webarchive.html b/html-versions/4 - Webarchive.html index 82b4b5a9f24a1de80b65cf6fe6ff6d3359c778e6..a11bcaf0b2577fc05d314f7b774b35630c87bdb9 100644 --- a/html-versions/4 - Webarchive.html +++ b/html-versions/4 - Webarchive.html @@ -13118,6 +13118,7 @@ div#notebook {

4 - Webarchive

https://webarchiv.onb.ac.at

https://labs.onb.ac.at/dataset/webarchive/

+

https://labs.onb.ac.at/gitlab/labs-team/webarchive-api

diff --git a/html-versions/4 - Webarchive.slides.html b/html-versions/4 - Webarchive.slides.html new file mode 100644 index 0000000000000000000000000000000000000000..c0c489b666f44731e57e4b1e7cd354a1bea1806c --- /dev/null +++ b/html-versions/4 - Webarchive.slides.html @@ -0,0 +1,13719 @@ + + + + + + + + + + + + +4 - Webarchive slides + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+
+
+
+
+
+

In this block

    +
  • Overview Webarchive
  • +
  • Overview Content
  • +
  • Overview API
  • +
+ +
+
+
+
+
+
+
    +
  • Example: Interacting with the API
  • +
  • Example: Wayback search via API
  • +
  • Example: Full text search via API
  • +
  • Example: Download preview SVG thumb of saved page
  • +
+ +
+
+
+
+
+
+

Overview Webarchive

https://webarchiv.onb.ac.at

+ +
+
+
+
+
+
+

ÖNB Webarchive Terminal

+ +
+
+
+
+
+
+

What is the Webarchive Austria?

+
+
+
+
+
+
+
    +
  • Attempt to conserve online data for future generations
  • +
  • Webarchive Austria crawls officially since March 2009
  • +
  • All domains within .at, .ac.at, .gv.at, .wien, .tirol
  • +
  • Selected other domains with 'Austrian content'
  • +
  • About 2 million websites saved
  • +
+ +
+
+
+
+
+
+

Who is the Webarchive Austria?

+
+
+
+
+
+
+
    +
  • Andreas Predikaka
  • +
  • webarchiv@onb.ac.at
  • +
+ +
+
+
+
+
+
+

What can I use?

+
+
+
+
+
+
+
    +
  • Websites: no public access
      +
    • Access on premises at the ÖNB
    • +
    • Exception: onb.ac.at
    • +
    +
  • +
  • Metadata: public access
  • +
  • Full text search: public access
      +
    • Viewing the results of the full text search: no public access
    • +
    +
  • +
+ +
+
+
+
+
+
+

What does that mean?

+
+
+
+
+
+
+
    +
  • Searching outside the ÖNB gives you URLs, doesn't give you page content
  • +
+ +
+
+
+
+
+
+

What if I really really need to see the content?

+
+
+
+
+
+
+
    +
  • You can come to the ÖNB in person and use one of two offline computers...
  • +
+ +
+
+
+
+
+
+
    +
  • ...to PRINT OUT THE INTERNET!
  • +
+ +
+
+
+
+
+
+

Office folders labeled 'Internet'

+ +
+
+
+
+
+
+

How is a search where I don't see detailed results useful to me?

+
+
+
+
+
+
+
    +
  • Sometimes the content is still online
  • +
  • Sometimes the Internet Archive has a copy
  • +
  • You can observe the emergence of certain terms ('Westbalkanroute', 'Soldatna')
  • +
+ +
+
+
+
+
+
+
    +
  • ???
  • +
+ +
+
+
+
+
+
+

Overview Content

https://webarchiv.onb.ac.at

+ +
+
+
+
+
+
+

What's inside?

+
+
+
+
+
+
+
    +
  • High crawl frequency (daily or weekly)
      +
    • Media sites (ORF)
    • +
    • Political parties
    • +
    +
  • +
  • Low crawl frequency (a few times per year)
      +
    • Topic: Gender
    • +
    • Austrian domains (via nic.at)
    • +
    +
  • +
  • Event crawls (daily or weekly within a certain timespan)
      +
    • Elections
    • +
    • Olympia
    • +
    • Refugee crisis 2015
    • +
    • Song Contest 2015
    • +
    +
  • +
+ +
+
+
+
+
+
+

Can I have a list?

+
+
+
+
+
+
+ +
[
+  {
+    "id": 37,
+    "name": "Frau/Gender",
+    "begin": "29.11.2016",
+    "groups": [
+      {
+        "seeds": [
+          "http://abtreibung.at/"
+        ],
+        "group_id": 1,
+        "name": "Abtreibung.at"
+      },
+      {
+        "seeds": [
+          "http://aep.at"
+        ],
+        "group_id": 2,
+        "name": "Arbeitskreis Emanzipation Partnerschaft"
+      },
+
+ +
+
+
+
+
+
+

How big is the Austrian Webarchive?

+
+
+
+
+
+
+
    +
  • About 500GiB indexed text
  • +
  • About 100 million HTML documents
  • +
  • Raw data: 115.28TiB uncompressed
  • +
+ +
+
+
+
+
+
+

Where's the catch?

+
+
+
+
+
+
+
    +
  • Social media is currently too hard to crawl
  • +
  • Limited disk space necessitates a size limit per page
      +
    • Ex: domain crawl 10MB -> 100MB -> 7GB
    • +
    +
  • +
  • Limitations of public access
      +
    • Practically every webarchive except the Internet Archive
    • +
    +
  • +
+ +
+
+
+
+
+ +
+
+
+
+
+

How can I access the Austrian Webarchive?

+
+
+
+
+
+
+

Why is access via API useful?

+
+
+
+
+
+
+
    +
  • Individual searches may take up to 1 minute
  • +
  • Sift through loads of metadata
  • +
  • API-only goodies
      +
    • Easily nominate pages with Austrian content to be saved
    • +
    • Download SVG thumbnails of rendered websites
    • +
    +
  • +
  • It's way more fun
  • +
+ +
+
+
+
+
+
+
    +
  • Make Andreas happy :)
  • +
+ +
+
+
+
+
+
+

Tracking

+
+
+
+
+
+
+

+

We'd love to properly count the unique visitors in the Webarchive backend, so we kindly ask you to opt in to tracking by instantiating webarchiv.WebarchivSession with the parameter allow_tracking=True.

+

This sends your SHA256-hashed MAC address as a fingerprint to the server on authentication. It is only ever used to count unique users.

+

If you leave allow_tracking at the default value False, an empty string is sent as fingerprint.

+ +
+
+
+
+
+ + + + + + +