diff --git a/4 - Webarchive.ipynb b/4 - Webarchive.ipynb index 9e3cbb83294644300c2c1fdacc183fe17ffeb3b5..722322dbfadc07a6f9dd0e95bd6eeef75e50ed5c 100644 --- a/4 - Webarchive.ipynb +++ b/4 - Webarchive.ipynb @@ -10,7 +10,9 @@ "source": [ "# 4 - Webarchive\n", "\n", - "[https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)" + "[https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)\n", + "\n", + "[https://labs.onb.ac.at/dataset/webarchive/](https://labs.onb.ac.at/dataset/webarchive/)" ] }, { @@ -21,11 +23,11 @@ } }, "source": [ - "#### In this block\n", + "### In this block\n", "\n", "* Overview Webarchive\n", - "* Overview API\n", - "* Overview Content" + "* Overview Content\n", + "* Overview API" ] }, { @@ -37,7 +39,8 @@ }, "source": [ "* Example Wayback search via API\n", - "* Example full text search via API" + "* Example full text search via API\n", + "* Example download preview SVG thumb of saved page" ] }, { @@ -61,7 +64,7 @@ } }, "source": [ - "#### What is the Webarchive Austria?" + "### What is the Webarchive Austria?" ] }, { @@ -87,7 +90,30 @@ } }, "source": [ - "#### What can I use?" + "### Who is the Webarchive Austria?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "* Andreas Predikaka\n", + "* webarchiv@onb.ac.at" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### What can I use?" ] }, { @@ -114,7 +140,7 @@ } }, "source": [ - "#### What does that mean?" + "### What does that mean?" ] }, { @@ -136,7 +162,7 @@ } }, "source": [ - "#### What if I really really need to see the content?" + "### What if I really really need to see the content?" ] }, { @@ -158,7 +184,17 @@ } }, "source": [ - "* ...to `PRINT OUT THE INTERNET!`\n", + "* ...to `PRINT OUT THE INTERNET!`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ "![Office folders labeled 'Internet'](./media/internet-folders.jpg)" ] }, @@ -170,7 +206,7 @@ } }, "source": [ - "#### How is a search where I don't see detailed results useful to me?" + "### How is a search where I don't see detailed results useful to me?" ] }, { @@ -205,22 +241,139 @@ } }, "source": [ - "## Overview API\n", + "## Overview Content\n", "\n", "[https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)" ] }, { "cell_type": "markdown", - "metadata": {}, - "source": [] + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### What's inside?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "* High crawl frequency (daily or weekly)\n", + " * Media sites (ORF)\n", + " * Political parties\n", + "* Low crawl frequency (a few times per year)\n", + " * Topic: Gender\n", + " * Austrian domains (via nic.at)\n", + "* Event crawls (daily or weekly within a certain timespan)\n", + " * Elections\n", + " * Olympia\n", + " * Refugee crisis 2015\n", + " * Song Contest 2015" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Can I have a list?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "* Sure, there you go:\n", + " * Media, political, gender: [https://webarchiv.onb.ac.at/data/selective.json](https://webarchiv.onb.ac.at/data/selective.json)\n", + " * Events: [https://webarchiv.onb.ac.at/data/events.json](https://webarchiv.onb.ac.at/data/events.json)\n", + " * All domains: **TODO: ADD LINK**\n", + "\n", + "```json\n", + "[\n", + " {\n", + " \"id\": 37,\n", + " \"name\": \"Frau/Gender\",\n", + " \"begin\": \"29.11.2016\",\n", + " \"groups\": [\n", + " {\n", + " \"seeds\": [\n", + " \"http://abtreibung.at/\"\n", + " ],\n", + " \"group_id\": 1,\n", + " \"name\": \"Abtreibung.at\"\n", + " },\n", + " {\n", + " \"seeds\": [\n", + " \"http://aep.at\"\n", + " ],\n", + " \"group_id\": 2,\n", + " \"name\": \"Arbeitskreis Emanzipation Partnerschaft\"\n", + " },```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### How big is the Austrian Webarchive?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "* About 500GiB indexed text\n", + "* About 100 million HTML documents\n", + "* Raw data: 115.28TiB uncompressed" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Where's the catch?" + ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "* Social media is currently too hard to crawl\n", + "* Limited disk space necessitates a size limit per page\n", + " * Ex: domain crawl 10MB -> 100MB -> 7GB\n", + "* Limitations of public access\n", + " * Practically every webarchive except the Internet Archive" + ] }, { "cell_type": "markdown", @@ -230,17 +383,74 @@ } }, "source": [ - "## Overview Content\n", + "## Overview API\n", "\n", "[https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)" ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### How can I access the Austrian Webarchive?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "* On site at the ONB\n", + "* Online: [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)\n", + "* REST API: [https://webarchiv.onb.ac.at/api.html](https://webarchiv.onb.ac.at/api.html)\n", + " * Swagger definition: [https://webarchiv.onb.ac.at/api/swagger.json](https://webarchiv.onb.ac.at/api/swagger.json)\n", + "* Python module for easier access: [https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/blob/master/webarchiv.py](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/blob/master/webarchiv.py) ([raw](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/raw/master/webarchiv.py?inline=false))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Why is access via API useful?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "* Individual searches may take up to 1 minute\n", + "* Sift through loads of metadata\n", + "* API-only goodies\n", + " * Easily nominate pages with Austrian content to be saved\n", + " * Download SVG thumbnails of rendered websites\n", + "* It's way more fun" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "* Make Andreas happy :)" + ] } ], "metadata": {