Skip to content
Snippets Groups Projects
Commit 01be7cea authored by Stefan Karner's avatar Stefan Karner
Browse files

Add Webarchive 4.2; add user tracking for webarchive

parent 4c0d4123
Branches
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# 4 - Webarchive
[https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)
[https://labs.onb.ac.at/dataset/webarchive/](https://labs.onb.ac.at/dataset/webarchive/)
%% Cell type:markdown id: tags:
### In this block
* Overview Webarchive
* Overview Content
* Overview API
%% Cell type:markdown id: tags:
* Example: Interacting with the API
* Example: Wayback search via API
* Example: Full text search via API
* Example: Download preview SVG thumb of saved page
%% Cell type:markdown id: tags:
## Overview Webarchive
[https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)
%% Cell type:markdown id: tags:
![ÖNB Webarchive Terminal](https://webarchiv.onb.ac.at/web/20170925041718/https://webarchiv.onb.ac.at/img/webarchiv_terminal1.jpg)
%% Cell type:markdown id: tags:
### What is the Webarchive Austria?
%% Cell type:markdown id: tags:
* Attempt to conserve online data for future generations
* Webarchive Austria crawls officially since March 2009
* All domains within `.at`, `.ac.at`, `.gv.at`, `.wien`, `.tirol`
* Selected other domains with 'Austrian content'
* About 2 million websites saved
%% Cell type:markdown id: tags:
### Who is the Webarchive Austria?
%% Cell type:markdown id: tags:
* Andreas Predikaka
* webarchiv@onb.ac.at
%% Cell type:markdown id: tags:
### What can I use?
%% Cell type:markdown id: tags:
* Websites: no public access
* Access on premises at the ÖNB
* Exception: onb.ac.at
* Metadata: public access
* Full text search: public access
* Viewing the results of the full text search: no public access
%% Cell type:markdown id: tags:
### What does that mean?
%% Cell type:markdown id: tags:
* Searching outside the ÖNB gives you URLs, doesn't give you page content
%% Cell type:markdown id: tags:
### What if I really really need to see the content?
%% Cell type:markdown id: tags:
* You can come to the ÖNB in person and use one of two offline computers...
%% Cell type:markdown id: tags:
* ...to `PRINT OUT THE INTERNET!`
%% Cell type:markdown id: tags:
![Office folders labeled 'Internet'](./media/internet-folders.jpg)
%% Cell type:markdown id: tags:
### How is a search where I don't see detailed results useful to me?
%% Cell type:markdown id: tags:
* Sometimes the content is still online
* Sometimes the Internet Archive has a copy
* You can observe the emergence of certain terms ('Westbalkanroute', 'Soldatna')
%% Cell type:markdown id: tags:
* ???
%% Cell type:markdown id: tags:
## Overview Content
[https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)
%% Cell type:markdown id: tags:
### What's inside?
%% Cell type:markdown id: tags:
* High crawl frequency (daily or weekly)
* Media sites (ORF)
* Political parties
* Low crawl frequency (a few times per year)
* Topic: Gender
* Austrian domains (via nic.at)
* Event crawls (daily or weekly within a certain timespan)
* Elections
* Olympia
* Refugee crisis 2015
* Song Contest 2015
%% Cell type:markdown id: tags:
### Can I have a list?
%% Cell type:markdown id: tags:
* Sure, there you go:
* Media, political, gender: [https://webarchiv.onb.ac.at/data/selective.json](https://webarchiv.onb.ac.at/data/selective.json)
* Events: [https://webarchiv.onb.ac.at/data/events.json](https://webarchiv.onb.ac.at/data/events.json)
* All domains: [https://webarchiv.onb.ac.at/data/domainnames.json](https://webarchiv.onb.ac.at/data/domainnames.json)
```json
[
{
"id": 37,
"name": "Frau/Gender",
"begin": "29.11.2016",
"groups": [
{
"seeds": [
"http://abtreibung.at/"
],
"group_id": 1,
"name": "Abtreibung.at"
},
{
"seeds": [
"http://aep.at"
],
"group_id": 2,
"name": "Arbeitskreis Emanzipation Partnerschaft"
},```
%% Cell type:markdown id: tags:
### How big is the Austrian Webarchive?
%% Cell type:markdown id: tags:
* About 500GiB indexed text
* About 100 million HTML documents
* Raw data: 115.28TiB uncompressed
%% Cell type:markdown id: tags:
### Where's the catch?
%% Cell type:markdown id: tags:
* Social media is currently too hard to crawl
* Limited disk space necessitates a size limit per page
* Ex: domain crawl 10MB -> 100MB -> 7GB
* Limitations of public access
* Practically every webarchive except the Internet Archive
%% Cell type:markdown id: tags:
## Overview API
[https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)
%% Cell type:markdown id: tags:
### How can I access the Austrian Webarchive?
%% Cell type:markdown id: tags:
* On site at the ONB
* Online: [https://webarchiv.onb.ac.at](https://webarchiv.onb.ac.at)
* REST API: [https://webarchiv.onb.ac.at/api.html](https://webarchiv.onb.ac.at/api.html)
* Swagger definition: [https://webarchiv.onb.ac.at/api/swagger.json](https://webarchiv.onb.ac.at/api/swagger.json)
* Python module for easier access: [https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/blob/master/webarchiv.py](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/blob/master/webarchiv.py) ([raw](https://labs.onb.ac.at/gitlab/labs-team/webarchive-api/raw/master/webarchiv.py?inline=false))
%% Cell type:markdown id: tags:
### Why is access via API useful?
%% Cell type:markdown id: tags:
* Individual searches may take up to 1 minute
* Sift through loads of metadata
* API-only goodies
* Easily nominate pages with Austrian content to be saved
* Download SVG thumbnails of rendered websites
* It's way more fun
%% Cell type:markdown id: tags:
* Make Andreas happy :)
%% Cell type:markdown id: tags:
## Tracking
%% Cell type:markdown id: tags:
<img src="https://purepng.com/public/uploads/large/purepng.com-cute-dog-whelpdogdoggycutehoundwhelpbrownbegging-451520332429trejj.png" style="max-height:200px;" />
We'd love to properly count the unique visitors in the Webarchive backend, so we kindly ask you to **opt in to tracking by instantiating `webarchiv.WebarchivSession` with the parameter `allow_tracking=True`**.
This sends your SHA256-hashed MAC address as a fingerprint to the server on authentication. It is only ever used to count unique users.
If you leave `allow_tracking` at the default value `False`, an empty string is sent as fingerprint.
......
This diff is collapsed.
......@@ -129,14 +129,13 @@ class WebarchivSession:
def fulltext_search(self, query_string, from_=None, to_=None):
"""
Start a fulltext search query in the Webarchive.
The current status of running queries can be read via status_open_queries().
:param query_string: String to search for
:param from_: Optional earliest date bound for the search
in the format YYYYMM.
:param to_: Optional latest date bound for the search
in the format YYYYMM.
:return: None
:return: HTTP Response object
"""
params = {'q': query_string}
if from_:
......@@ -152,17 +151,66 @@ class WebarchivSession:
self._display_http_error(e)
print('Query for "{}" not added'.format(query_string))
def fulltext_search_within_domain(self, query_string, domain, from_=None, to_=None):
"""
Start a fulltext seed search query in the Webarchive.
:param query_string: String to search for
:param domain: Search only within this domain name
:param from_: Optional earliest date bound for the search
in the format YYYYMM.
:param to_: Optional latest date bound for the search
in the format YYYYMM.
:return: HTTP Response object
"""
params = {'q': query_string, 'g': domain}
if from_:
params['from'] = from_
if to_:
params['to'] = to_
try:
response = self._get(op='/search/fulltext/seed', params=params)
return self.waitForResponse(response)
except HTTPError as e:
self._display_http_error(e)
def fulltext_search_within_url(self, query_string, url, pagesize=10, from_=None, to_=None):
"""
Start a fulltext capture search query in the Webarchive.
:param query_string: String to search for
:param url: Search only captures starting at this exact web address
:param from_: Optional earliest date bound for the search
in the format YYYYMM.
:param to_: Optional latest date bound for the search
in the format YYYYMM.
:return: HTTP Response object
"""
params = {'q': query_string, 'g': url, 'pagesize': pagesize}
if from_:
params['from'] = from_
if to_:
params['to'] = to_
try:
response = self._get(op='/search/fulltext/capture', params=params)
return self.waitForResponse(response)
except HTTPError as e:
self._display_http_error(e)
def wayback_search(self, query_string, from_=None, to_=None):
"""
Start a wayback search query in the Webarchive.
The current status of running queries can be read via status_open_queries().
:param query_string: String to search for
:param from_: Optional earliest date bound for the search
in the format YYYYMM.
:param to_: Optional latest date bound for the search
in the format YYYYMM.
:return: None
:return: HTTP Response object
"""
params = {'q': query_string}
if from_:
......@@ -213,7 +261,6 @@ class WebarchivSession:
def domain_name_search(self, query_string, page_=1, pagesize_=100):
"""
Start a domain name search in the Webarchive.
The current status of running queries can be read via status_open_queries().
:param query_string: String to search for
:param page_: The page number parameter works with the page size parameter to control the offset of the records returned in the results. Default value is 1
......@@ -237,7 +284,6 @@ class WebarchivSession:
def histogram_search(self, query_string, interval_=3, from_=None, to_=None):
"""
Start a domain name search in the Webarchive.
The current status of running queries can be read via status_open_queries().
:param query_string: String to search for
:param page_: The page number parameter works with the page size parameter to control the offset of the records returned in the results. Default value is 1
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment