Loading 4 - Webarchive.ipynb +119 −15 Original line number Diff line number Diff line %% Cell type:markdown id: tags: # 4 - Webarchive [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) [https://labs.onb.ac.at/dataset/webarchive/](https://labs.onb.ac.at/dataset/webarchive/) %% Cell type:markdown id: tags: #### In this block ### In this block * Overview Webarchive * Overview API * Overview Content * Overview API %% Cell type:markdown id: tags: * Example Wayback search via API * Example full text search via API * Example download preview SVG thumb of saved page %% Cell type:markdown id: tags: ## Overview Webarchive [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) %% Cell type:markdown id: tags: #### What is the Webarchive Austria? ### What is the Webarchive Austria? %% Cell type:markdown id: tags: * Attempt to conserve online data for future generations * Webarchive Austria crawls officially since March 2009 * All domains within `.at`, `.ac.at`, `.gv.at`, `.wien`, `.tirol` * Selected other domains with 'Austrian content' * About 2 million websites saved %% Cell type:markdown id: tags: #### What can I use? ### Who is the Webarchive Austria? %% Cell type:markdown id: tags: * Andreas Predikaka * webarchiv@onb.ac.at %% Cell type:markdown id: tags: ### What can I use? %% Cell type:markdown id: tags: * Websites: no public access * Access on premises at the ÖNB * Exception: onb.ac.at * Metadata: public access * Full text search: public access * Viewing the results of the full text search: no public access %% Cell type:markdown id: tags: #### What does that mean? ### What does that mean? %% Cell type:markdown id: tags: * Searching outside the ÖNB gives you URLs, doesn't give you page content %% Cell type:markdown id: tags: #### What if I really really need to see the content? ### What if I really really need to see the content? %% Cell type:markdown id: tags: * You can come to the ÖNB in person and use one of two offline computers... %% Cell type:markdown id: tags: * ...to `PRINT OUT THE INTERNET!` %% Cell type:markdown id: tags:  %% Cell type:markdown id: tags: #### How is a search where I don't see detailed results useful to me? ### How is a search where I don't see detailed results useful to me? %% Cell type:markdown id: tags: * Sometimes the content is still online * Sometimes the Internet Archive has a copy * You can observe the emergence of certain terms ('Westbalkanroute', 'Soldatna') %% Cell type:markdown id: tags: * ??? %% Cell type:markdown id: tags: ## Overview API ## Overview Content [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) %% Cell type:markdown id: tags: ### What's inside? %% Cell type:code id: tags: %% Cell type:markdown id: tags: ``` python ``` * High crawl frequency (daily or weekly) * Media sites (ORF) * Political parties * Low crawl frequency (a few times per year) * Topic: Gender * Austrian domains (via nic.at) * Event crawls (daily or weekly within a certain timespan) * Elections * Olympia * Refugee crisis 2015 * Song Contest 2015 %% Cell type:markdown id: tags: ## Overview Content ### Can I have a list? %% Cell type:markdown id: tags: * Sure, there you go: * Media, political, gender: [https://webarchiv.onb.ac.at/data/selective.json](https://webarchiv.onb.ac.at/data/selective.json) * Events: [https://webarchiv.onb.ac.at/data/events.json](https://webarchiv.onb.ac.at/data/events.json) * All domains: **TODO: ADD LINK** ```json [ { "id": 37, "name": "Frau/Gender", "begin": "29.11.2016", "groups": [ { "seeds": [ "http://abtreibung.at/" ], "group_id": 1, "name": "Abtreibung.at" }, { "seeds": [ "http://aep.at" ], "group_id": 2, "name": "Arbeitskreis Emanzipation Partnerschaft" },``` %% Cell type:markdown id: tags: ### How big is the Austrian Webarchive? %% Cell type:markdown id: tags: * About 500GiB indexed text * About 100 million HTML documents * Raw data: 115.28TiB uncompressed %% Cell type:markdown id: tags: ### Where's the catch? %% Cell type:markdown id: tags: * Social media is currently too hard to crawl * Limited disk space necessitates a size limit per page * Ex: domain crawl 10MB -> 100MB -> 7GB * Limitations of public access * Practically every webarchive except the Internet Archive %% Cell type:markdown id: tags: ## Overview API [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) %% Cell type:code id: tags: %% Cell type:markdown id: tags: ### How can I access the Austrian Webarchive? %% Cell type:markdown id: tags: * On site at the ONB * Online: [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) * REST API: [https://webarchiv.onb.ac.at/api.html](https://webarchiv.onb.ac.at/api.html) * Swagger definition: [https://webarchiv.onb.ac.at/api/swagger.json](https://webarchiv.onb.ac.at/api/swagger.json) * Python module for easier access: [https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/blob/master/webarchiv.py](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/blob/master/webarchiv.py) ([raw](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/raw/master/webarchiv.py?inline=false)) %% Cell type:markdown id: tags: ### Why is access via API useful? %% Cell type:markdown id: tags: * Individual searches may take up to 1 minute * Sift through loads of metadata * API-only goodies * Easily nominate pages with Austrian content to be saved * Download SVG thumbnails of rendered websites * It's way more fun %% Cell type:markdown id: tags: ``` python ``` * Make Andreas happy :) Loading
4 - Webarchive.ipynb +119 −15 Original line number Diff line number Diff line %% Cell type:markdown id: tags: # 4 - Webarchive [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) [https://labs.onb.ac.at/dataset/webarchive/](https://labs.onb.ac.at/dataset/webarchive/) %% Cell type:markdown id: tags: #### In this block ### In this block * Overview Webarchive * Overview API * Overview Content * Overview API %% Cell type:markdown id: tags: * Example Wayback search via API * Example full text search via API * Example download preview SVG thumb of saved page %% Cell type:markdown id: tags: ## Overview Webarchive [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) %% Cell type:markdown id: tags: #### What is the Webarchive Austria? ### What is the Webarchive Austria? %% Cell type:markdown id: tags: * Attempt to conserve online data for future generations * Webarchive Austria crawls officially since March 2009 * All domains within `.at`, `.ac.at`, `.gv.at`, `.wien`, `.tirol` * Selected other domains with 'Austrian content' * About 2 million websites saved %% Cell type:markdown id: tags: #### What can I use? ### Who is the Webarchive Austria? %% Cell type:markdown id: tags: * Andreas Predikaka * webarchiv@onb.ac.at %% Cell type:markdown id: tags: ### What can I use? %% Cell type:markdown id: tags: * Websites: no public access * Access on premises at the ÖNB * Exception: onb.ac.at * Metadata: public access * Full text search: public access * Viewing the results of the full text search: no public access %% Cell type:markdown id: tags: #### What does that mean? ### What does that mean? %% Cell type:markdown id: tags: * Searching outside the ÖNB gives you URLs, doesn't give you page content %% Cell type:markdown id: tags: #### What if I really really need to see the content? ### What if I really really need to see the content? %% Cell type:markdown id: tags: * You can come to the ÖNB in person and use one of two offline computers... %% Cell type:markdown id: tags: * ...to `PRINT OUT THE INTERNET!` %% Cell type:markdown id: tags:  %% Cell type:markdown id: tags: #### How is a search where I don't see detailed results useful to me? ### How is a search where I don't see detailed results useful to me? %% Cell type:markdown id: tags: * Sometimes the content is still online * Sometimes the Internet Archive has a copy * You can observe the emergence of certain terms ('Westbalkanroute', 'Soldatna') %% Cell type:markdown id: tags: * ??? %% Cell type:markdown id: tags: ## Overview API ## Overview Content [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) %% Cell type:markdown id: tags: ### What's inside? %% Cell type:code id: tags: %% Cell type:markdown id: tags: ``` python ``` * High crawl frequency (daily or weekly) * Media sites (ORF) * Political parties * Low crawl frequency (a few times per year) * Topic: Gender * Austrian domains (via nic.at) * Event crawls (daily or weekly within a certain timespan) * Elections * Olympia * Refugee crisis 2015 * Song Contest 2015 %% Cell type:markdown id: tags: ## Overview Content ### Can I have a list? %% Cell type:markdown id: tags: * Sure, there you go: * Media, political, gender: [https://webarchiv.onb.ac.at/data/selective.json](https://webarchiv.onb.ac.at/data/selective.json) * Events: [https://webarchiv.onb.ac.at/data/events.json](https://webarchiv.onb.ac.at/data/events.json) * All domains: **TODO: ADD LINK** ```json [ { "id": 37, "name": "Frau/Gender", "begin": "29.11.2016", "groups": [ { "seeds": [ "http://abtreibung.at/" ], "group_id": 1, "name": "Abtreibung.at" }, { "seeds": [ "http://aep.at" ], "group_id": 2, "name": "Arbeitskreis Emanzipation Partnerschaft" },``` %% Cell type:markdown id: tags: ### How big is the Austrian Webarchive? %% Cell type:markdown id: tags: * About 500GiB indexed text * About 100 million HTML documents * Raw data: 115.28TiB uncompressed %% Cell type:markdown id: tags: ### Where's the catch? %% Cell type:markdown id: tags: * Social media is currently too hard to crawl * Limited disk space necessitates a size limit per page * Ex: domain crawl 10MB -> 100MB -> 7GB * Limitations of public access * Practically every webarchive except the Internet Archive %% Cell type:markdown id: tags: ## Overview API [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) %% Cell type:code id: tags: %% Cell type:markdown id: tags: ### How can I access the Austrian Webarchive? %% Cell type:markdown id: tags: * On site at the ONB * Online: [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) * REST API: [https://webarchiv.onb.ac.at/api.html](https://webarchiv.onb.ac.at/api.html) * Swagger definition: [https://webarchiv.onb.ac.at/api/swagger.json](https://webarchiv.onb.ac.at/api/swagger.json) * Python module for easier access: [https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/blob/master/webarchiv.py](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/blob/master/webarchiv.py) ([raw](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/raw/master/webarchiv.py?inline=false)) %% Cell type:markdown id: tags: ### Why is access via API useful? %% Cell type:markdown id: tags: * Individual searches may take up to 1 minute * Sift through loads of metadata * API-only goodies * Easily nominate pages with Austrian content to be saved * Download SVG thumbnails of rendered websites * It's way more fun %% Cell type:markdown id: tags: ``` python ``` * Make Andreas happy :)