Loading 4 - Webarchive.ipynb +2 −0 Original line number Diff line number Diff line %% Cell type:markdown id: tags: # 4 - Webarchive [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) [https://labs.onb.ac.at/dataset/webarchive/](https://labs.onb.ac.at/dataset/webarchive/) [https://labs.onb.ac.at/gitlab/labs-team/webarchive-api](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api) %% Cell type:markdown id: tags: ### In this block * Overview Webarchive * Overview Content * Overview API %% Cell type:markdown id: tags: * Example: Interacting with the API * Example: Wayback search via API * Example: Full text search via API * Example: Download preview SVG thumb of saved page %% Cell type:markdown id: tags: ## Overview Webarchive [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) %% Cell type:markdown id: tags:  %% Cell type:markdown id: tags: ### What is the Webarchive Austria? %% Cell type:markdown id: tags: * Attempt to conserve online data for future generations * Webarchive Austria crawls officially since March 2009 * All domains within `.at`, `.ac.at`, `.gv.at`, `.wien`, `.tirol` * Selected other domains with 'Austrian content' * About 2 million websites saved %% Cell type:markdown id: tags: ### Who is the Webarchive Austria? %% Cell type:markdown id: tags: * Andreas Predikaka * webarchiv@onb.ac.at %% Cell type:markdown id: tags: ### What can I use? %% Cell type:markdown id: tags: * Websites: no public access * Access on premises at the ÖNB * Exception: onb.ac.at * Metadata: public access * Full text search: public access * Viewing the results of the full text search: no public access %% Cell type:markdown id: tags: ### What does that mean? %% Cell type:markdown id: tags: * Searching outside the ÖNB gives you URLs, doesn't give you page content %% Cell type:markdown id: tags: ### What if I really really need to see the content? %% Cell type:markdown id: tags: * You can come to the ÖNB in person and use one of two offline computers... %% Cell type:markdown id: tags: * ...to `PRINT OUT THE INTERNET!` %% Cell type:markdown id: tags:  %% Cell type:markdown id: tags: ### How is a search where I don't see detailed results useful to me? %% Cell type:markdown id: tags: * Sometimes the content is still online * Sometimes the Internet Archive has a copy * You can observe the emergence of certain terms ('Westbalkanroute', 'Soldatna') %% Cell type:markdown id: tags: * ??? %% Cell type:markdown id: tags: ## Overview Content [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) %% Cell type:markdown id: tags: ### What's inside? %% Cell type:markdown id: tags: * High crawl frequency (daily or weekly) * Media sites (ORF) * Political parties * Low crawl frequency (a few times per year) * Topic: Gender * Austrian domains (via nic.at) * Event crawls (daily or weekly within a certain timespan) * Elections * Olympia * Refugee crisis 2015 * Song Contest 2015 %% Cell type:markdown id: tags: ### Can I have a list? %% Cell type:markdown id: tags: * Sure, there you go: * Media, political, gender: [https://webarchiv.onb.ac.at/data/selective.json](https://webarchiv.onb.ac.at/data/selective.json) * Events: [https://webarchiv.onb.ac.at/data/events.json](https://webarchiv.onb.ac.at/data/events.json) * All domains: [https://webarchiv.onb.ac.at/data/domainnames.json](https://webarchiv.onb.ac.at/data/domainnames.json) ```json [ { "id": 37, "name": "Frau/Gender", "begin": "29.11.2016", "groups": [ { "seeds": [ "http://abtreibung.at/" ], "group_id": 1, "name": "Abtreibung.at" }, { "seeds": [ "http://aep.at" ], "group_id": 2, "name": "Arbeitskreis Emanzipation Partnerschaft" },``` %% Cell type:markdown id: tags: ### How big is the Austrian Webarchive? %% Cell type:markdown id: tags: * About 500GiB indexed text * About 100 million HTML documents * Raw data: 115.28TiB uncompressed %% Cell type:markdown id: tags: ### Where's the catch? %% Cell type:markdown id: tags: * Social media is currently too hard to crawl * Limited disk space necessitates a size limit per page * Ex: domain crawl 10MB -> 100MB -> 7GB * Limitations of public access * Practically every webarchive except the Internet Archive %% Cell type:markdown id: tags: ## Overview API [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) %% Cell type:markdown id: tags: ### How can I access the Austrian Webarchive? %% Cell type:markdown id: tags: * On site at the ONB * Online: [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) * REST API: [https://webarchiv.onb.ac.at/api.html](https://webarchiv.onb.ac.at/api.html) * Swagger definition: [https://webarchiv.onb.ac.at/api/swagger.json](https://webarchiv.onb.ac.at/api/swagger.json) * Python module for easier access: [https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/blob/master/webarchiv.py](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/blob/master/webarchiv.py) ([raw](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/raw/master/webarchiv.py?inline=false)) %% Cell type:markdown id: tags: ### Why is access via API useful? %% Cell type:markdown id: tags: * Individual searches may take up to 1 minute * Sift through loads of metadata * API-only goodies * Easily nominate pages with Austrian content to be saved * Download SVG thumbnails of rendered websites * It's way more fun %% Cell type:markdown id: tags: * Make Andreas happy :) %% Cell type:markdown id: tags: ## Tracking %% Cell type:markdown id: tags: <img src="https://purepng.com/public/uploads/large/purepng.com-cute-dog-whelpdogdoggycutehoundwhelpbrownbegging-451520332429trejj.png" style="max-height:200px;" /> We'd love to properly count the unique visitors in the Webarchive backend, so we kindly ask you to **opt in to tracking by instantiating `webarchiv.WebarchivSession` with the parameter `allow_tracking=True`**. This sends your SHA256-hashed MAC address as a fingerprint to the server on authentication. It is only ever used to count unique users. If you leave `allow_tracking` at the default value `False`, an empty string is sent as fingerprint. html-versions/4 - Webarchive.html +1 −0 Original line number Diff line number Diff line Loading @@ -13118,6 +13118,7 @@ div#notebook { <div class="text_cell_render border-box-sizing rendered_html"> <h1 id="4---Webarchive">4 - Webarchive<a class="anchor-link" href="#4---Webarchive">¶</a></h1><p><a href="https://webarchiv.onb.ac.at">https://webarchiv.onb.ac.at</a></p> <p><a href="https://labs.onb.ac.at/dataset/webarchive/">https://labs.onb.ac.at/dataset/webarchive/</a></p> <p><a href="https://labs.onb.ac.at/gitlab/labs-team/webarchive-api">https://labs.onb.ac.at/gitlab/labs-team/webarchive-api</a></p> </div> </div> Loading html-versions/4 - Webarchive.slides.html 0 → 100644 +13719 −0 File added.Preview size limit exceeded, changes collapsed. Show changes Loading
4 - Webarchive.ipynb +2 −0 Original line number Diff line number Diff line %% Cell type:markdown id: tags: # 4 - Webarchive [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) [https://labs.onb.ac.at/dataset/webarchive/](https://labs.onb.ac.at/dataset/webarchive/) [https://labs.onb.ac.at/gitlab/labs-team/webarchive-api](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api) %% Cell type:markdown id: tags: ### In this block * Overview Webarchive * Overview Content * Overview API %% Cell type:markdown id: tags: * Example: Interacting with the API * Example: Wayback search via API * Example: Full text search via API * Example: Download preview SVG thumb of saved page %% Cell type:markdown id: tags: ## Overview Webarchive [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) %% Cell type:markdown id: tags:  %% Cell type:markdown id: tags: ### What is the Webarchive Austria? %% Cell type:markdown id: tags: * Attempt to conserve online data for future generations * Webarchive Austria crawls officially since March 2009 * All domains within `.at`, `.ac.at`, `.gv.at`, `.wien`, `.tirol` * Selected other domains with 'Austrian content' * About 2 million websites saved %% Cell type:markdown id: tags: ### Who is the Webarchive Austria? %% Cell type:markdown id: tags: * Andreas Predikaka * webarchiv@onb.ac.at %% Cell type:markdown id: tags: ### What can I use? %% Cell type:markdown id: tags: * Websites: no public access * Access on premises at the ÖNB * Exception: onb.ac.at * Metadata: public access * Full text search: public access * Viewing the results of the full text search: no public access %% Cell type:markdown id: tags: ### What does that mean? %% Cell type:markdown id: tags: * Searching outside the ÖNB gives you URLs, doesn't give you page content %% Cell type:markdown id: tags: ### What if I really really need to see the content? %% Cell type:markdown id: tags: * You can come to the ÖNB in person and use one of two offline computers... %% Cell type:markdown id: tags: * ...to `PRINT OUT THE INTERNET!` %% Cell type:markdown id: tags:  %% Cell type:markdown id: tags: ### How is a search where I don't see detailed results useful to me? %% Cell type:markdown id: tags: * Sometimes the content is still online * Sometimes the Internet Archive has a copy * You can observe the emergence of certain terms ('Westbalkanroute', 'Soldatna') %% Cell type:markdown id: tags: * ??? %% Cell type:markdown id: tags: ## Overview Content [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) %% Cell type:markdown id: tags: ### What's inside? %% Cell type:markdown id: tags: * High crawl frequency (daily or weekly) * Media sites (ORF) * Political parties * Low crawl frequency (a few times per year) * Topic: Gender * Austrian domains (via nic.at) * Event crawls (daily or weekly within a certain timespan) * Elections * Olympia * Refugee crisis 2015 * Song Contest 2015 %% Cell type:markdown id: tags: ### Can I have a list? %% Cell type:markdown id: tags: * Sure, there you go: * Media, political, gender: [https://webarchiv.onb.ac.at/data/selective.json](https://webarchiv.onb.ac.at/data/selective.json) * Events: [https://webarchiv.onb.ac.at/data/events.json](https://webarchiv.onb.ac.at/data/events.json) * All domains: [https://webarchiv.onb.ac.at/data/domainnames.json](https://webarchiv.onb.ac.at/data/domainnames.json) ```json [ { "id": 37, "name": "Frau/Gender", "begin": "29.11.2016", "groups": [ { "seeds": [ "http://abtreibung.at/" ], "group_id": 1, "name": "Abtreibung.at" }, { "seeds": [ "http://aep.at" ], "group_id": 2, "name": "Arbeitskreis Emanzipation Partnerschaft" },``` %% Cell type:markdown id: tags: ### How big is the Austrian Webarchive? %% Cell type:markdown id: tags: * About 500GiB indexed text * About 100 million HTML documents * Raw data: 115.28TiB uncompressed %% Cell type:markdown id: tags: ### Where's the catch? %% Cell type:markdown id: tags: * Social media is currently too hard to crawl * Limited disk space necessitates a size limit per page * Ex: domain crawl 10MB -> 100MB -> 7GB * Limitations of public access * Practically every webarchive except the Internet Archive %% Cell type:markdown id: tags: ## Overview API [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) %% Cell type:markdown id: tags: ### How can I access the Austrian Webarchive? %% Cell type:markdown id: tags: * On site at the ONB * Online: [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at) * REST API: [https://webarchiv.onb.ac.at/api.html](https://webarchiv.onb.ac.at/api.html) * Swagger definition: [https://webarchiv.onb.ac.at/api/swagger.json](https://webarchiv.onb.ac.at/api/swagger.json) * Python module for easier access: [https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/blob/master/webarchiv.py](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/blob/master/webarchiv.py) ([raw](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/raw/master/webarchiv.py?inline=false)) %% Cell type:markdown id: tags: ### Why is access via API useful? %% Cell type:markdown id: tags: * Individual searches may take up to 1 minute * Sift through loads of metadata * API-only goodies * Easily nominate pages with Austrian content to be saved * Download SVG thumbnails of rendered websites * It's way more fun %% Cell type:markdown id: tags: * Make Andreas happy :) %% Cell type:markdown id: tags: ## Tracking %% Cell type:markdown id: tags: <img src="https://purepng.com/public/uploads/large/purepng.com-cute-dog-whelpdogdoggycutehoundwhelpbrownbegging-451520332429trejj.png" style="max-height:200px;" /> We'd love to properly count the unique visitors in the Webarchive backend, so we kindly ask you to **opt in to tracking by instantiating `webarchiv.WebarchivSession` with the parameter `allow_tracking=True`**. This sends your SHA256-hashed MAC address as a fingerprint to the server on authentication. It is only ever used to count unique users. If you leave `allow_tracking` at the default value `False`, an empty string is sent as fingerprint.
html-versions/4 - Webarchive.html +1 −0 Original line number Diff line number Diff line Loading @@ -13118,6 +13118,7 @@ div#notebook { <div class="text_cell_render border-box-sizing rendered_html"> <h1 id="4---Webarchive">4 - Webarchive<a class="anchor-link" href="#4---Webarchive">¶</a></h1><p><a href="https://webarchiv.onb.ac.at">https://webarchiv.onb.ac.at</a></p> <p><a href="https://labs.onb.ac.at/dataset/webarchive/">https://labs.onb.ac.at/dataset/webarchive/</a></p> <p><a href="https://labs.onb.ac.at/gitlab/labs-team/webarchive-api">https://labs.onb.ac.at/gitlab/labs-team/webarchive-api</a></p> </div> </div> Loading
html-versions/4 - Webarchive.slides.html 0 → 100644 +13719 −0 File added.Preview size limit exceeded, changes collapsed. Show changes