This page contains information about web archiving terminology, how the process works and its limitations, plus an exciting opportunity to help improve the Parliamentary Web Archive.
How web archiving works
Parliament is in partnership with the Internet Memory Foundation (IMF) (formerly known as the European Archive) to ensure our websites are archived.
The IMF carry out the web harvest on our behalf, and capture the content, as well as hosting a presentation copy for us, so that we can provide online access to the collection via our own website.
The IMF uses a web harvesting tool called Heritrix, which works in a very similar way to the robots used by search engines like Google for indexing web pages. It works by crawling across the web, following every link that it finds and capturing the content.
Heritrix should capture every page that is linked to within a particular site, given a seed URL (usually the homepage).
- Web Archive – a collection of previous versions of websites which can be viewed
- Live web(site) – the current version of a website which is online today
- Snapshot – a previous version of a website which has been archived and is available for viewing
- Crawling or harvesting – the way the snapshot is created and a website is archived.
- Internal links – links on a webpage which lead to other pages from the same website
- External links – links on a webpage which lead to pages on other websites
- Seed URL – the main website address that the harvesting tool uses as the basis for archiving. Usually the homepage of the site
- Dynamic functionality – parts of a website which are only generated in response to an action by the user (e.g. typing in a search term)
Limitations to what can be captured
External links will not be captured, since they fall outside of the scope of the collection. Therefore, clicking on external URLs from an archived snapshot will lead to an error message.
Search functionality will not be in use in archived snapshots. Users will be able to browse through the archived sites by following links but not by searching for content.
Other content which is dynamically generated by querying a database underlying a web page, is unlikely to be captured. This is because it is not possible for the harvesting tools to replicate this kind of user interaction.
Other web archive collections
There are many other organisations both nationally and internationally, who are carrying out web archiving. A few collections are listed below:
There are many other organisations that are collecting large scale web archives, which are not yet available for online access.