Status and Availability
Arch 1.9.2, based on Apache Nutch 1.9 has been released on the 18th of August 2016. For information on licensing and installation see the links on the left.
New in Arch version 1.9.2
- PHP frontend used to put junky looking content in the query field on results pages when an advanced query was submitted. Now it leaves this field empty in case of an advanced query.
- Made name field shorter (1K instead of 2K) in site DB tables. A too loong field length resulted in an index key that was too long for some MySQL configurations.
- Moved to new version numbering scheme to align it with the Apache Nutch version numbering scheme.
New in Arch version 1.91
- Fixed a bug causing problems in enforcing access permissions.
New in Arch version 1.9
- Added post-parsing pruning.
- Changed order of application of parsers, moved Tika to top.
- Small bugs fixes.
New in Arch version 1.9b
- Fixed Nutch bug that effectively blocked use of multiple parsers on a document.
- Improved scoring and fetching of dynamic content.
- Ported to Nutch 1.9.
- Improved identification of junk records and IP addresses to ignore when analysing log files.
- Improved (more scalable) identification and removal of duplicated URLs.
- Added removal of gone URLs from the Solr index.
- Small bug fixes.
New in Arch version 1.7
- Ported to Nutch 1.7.
- Added a plugin for use of H2 RDBMS.
- Added Jetty servlet engine and made it default Solr (index) server.
- Made per-site configuration folders optional.
- Simplified deployment for simple/small sites. It takes 15 minutes now to deploy Arch.
- Small bug fixes.
New in Arch version 1.6
- Fixed bugs found in version 1.6b.
New in Arch version 1.6b
- Ported to Nutch 1.6.
- Added scanning web pages for threats and vulnerabilities.
- Added reporting of various changes, such as new pages, scripts, added or removed links.
- Added customizable page pruning before indexing.
- Made output Level A Web Content Accessibility Guidelines (WCAG) 2.0 compliant.
- Small bug fixes.
New in Arch version 1.43
- Minor changes to faceting and bug fixes.
New in Arch version 1.42
- Added easy to customize faceted search.
- Added remote log processing.
- Added an option to delete logs after processing.
- Added an option to switch security OFF (e.g. for debugging).
- Added a DB plugin that uses Apache Derby instead of MySQL.
- Removed jars that were conflicting with Tomcat in some setups.
- Changed the default protocol plugin from protocol-httpclient to protocol-http.
New in Arch version 1.41
- Added a browser based configuration management module (beta).
- Fixed known bugs.
New in Arch version 1.4
- Improved query precision achieved by tuning Solr and making a couple modifications to the standard query processing.
New in Arch version 1.4b2
- Various bug fixes.
New in Arch version 1.4b
- Ported to Nutch 1.4 - a completely new architecture. See Arch White Paper.
- Simplified configuration and restructured configuration data.
- Optimized use of RDB connections.
- Fixed many bugs caused by the big move, and very likely, added new ones.
New in Arch version 1.23:
- Added compatibility with Windows and Cygwin.
New in Arch version 1.22:
- Added a way to use a HTTP request to trigger index "hot" re-opening (to switch to new index after recrawling).
- Fixed a bug in reuse of old index segments in partual recrawling mode.
New in Arch version 1.21:
- Ported to the latest stable release of Nutch 1.2.
- Added indexing of bookmark collections.
- Added email notifications.
- Fixed a bug in automatic switching to new index, and a few other minor bugs.
- Changed the default set of used parsers. This significantly increased parsing success rate.
System requirements: A Linux/Unix or Windows 7/Vista OS, Java, Tomcat. See details here.