CSIRO Arch Intranet Search Engine Home
What is CSIRO Arch?
CSIRO Arch is an open source free extension of Apache Nutch, a popular general purpose search engine that is capable of indexing billions of web pages using clusters of computers. Arch uses Nutch software and adds additional features to provide a powerful and efficient search engine that is optimized for use in corporate web environments. Such environments typically have one or more web sites, with web content provided for external readers and internal use, and one or more "intranet" sites that provide content for internal use only. Arch can be used to search both the external access and restricted access sites and produces extremely high quality search results.
Corporate Search: Can We Just Get Google?
Corporate web environments are a challenging area for modern search engines. Whilst they may include multiple web sites and millions of pages, compared to the global Web they are much smaller and this makes them easier to index. However, the smaller scale of corporate environments and the more restricted access to information also make it harder to estimate the relative importance of documents found on corporate web sites. The search methods used to search the global Web generally do not work well on a smaller scale and this leads to frustration for companies who often find that searches on their intranets are of limited use.
Arch has been specifically designed to provide very high quality searches for intranet web environments. Arch makes use of web server logs and other information that is available within an organization, but not available to external search engines, to provide excellent search results. It is robust and easy for a webmaster to install and maintain, and is extremely efficient at providing relevant and up-to-date information.
Read more in this article about Arch...
Arch Features
- Excellent search quality: Arch has solved the problem of providing good search results for enterprise web sites and intranets!
- Up to date information: Arch is very efficient at updating indexes and this ensures that the search results are up to date and relevant. Unlike most search engines, no complete 'recrawls' are done. The indexes can be updated daily, with new pages discovered automatically.
- Multiple web sites: Arch supports easy dynamic inclusion or removal of websites.
- "Setup and forget": Arch can be installed by one person with limited webmaster experience. Once it has been installed, it requires little effort to maintain.
- 24/7 availability: Arch uses two indexes so that the search engine always has a working index. All crawling is done to a new index and switching to the new index happens only when the crawling has been completed.
- Document level security Arch is easy to configure and can be set up so that some parts are restricted to different users. Users can find only documents that they have a permission to see - a must in enterprise environment.
- Customization:. Arch can be customized to specific requirements using either Java or PHP.
Read more in Arch White Paper...
Status and Availability
Arch 1.4b2 based on Apache Nutch 1.4 has been released on the 17th of January 2012. For information on licensing and installation see the links on the left. This beta version has no known issues, except those inherited from Nutch and Solr. It is reasonably stable, but more testing and tuning is required. Solr is very flexible and it is very likely that better average search precision can be acheved.
New in Arch version 1.4b2
- Various bug fixes. See ARCH-README.txt file for details.
New in Arch version 1.4b
- Ported to Nutch 1.4 - a completely new architecture. See Arch White Paper.
- Simplified configuration and restructured configuration data.
- Optimized use of RDB connections.
- Fixed many bugs caused by the big move, and very likely, added new ones.
New in Arch version 1.23:
- Added compatibility with Windows and Cygwin.
New in Arch version 1.22:
- Added a way to use a HTTP request to trigger index "hot" re-opening (to switch to new index after recrawling).
- Fixed a bug in reuse of old index segments in partual recrawling mode.
New in Arch version 1.21:
- Ported to the latest stable release of Nutch 1.2.
- Added indexing of bookmark collections.
- Added email notifications.
- Fixed a bug in automatic switching to new index, and a few other minor bugs.
- Changed the default set of used parsers. This significantly increased parsing success rate.
System requirements: A Linux/Unix or Windows 7/Vista OS, Java, Tomcat. See details here.
Download search engine software...



