CSIRO Arch Intranet Search Engine Home
What is CSIRO Arch?
CSIRO Arch is an open source free enterprise search engine based on Apache Nutch, a
popular general purpose search engine that is capable of indexing billions of web pages using clusters of computers.
Arch uses Nutch and Solr software and adds additional features to provide
a powerful and efficient search engine that is optimized for use in corporate web environments.
See the Arch White Paper for more information.
Three very good reasons to choose Arch
Arch is an open source, free software package. It is very scalable, able to service intranets of any size, and offers a set
of features that are normally available only in expensive commercial products. See the list of features below.
The quality of search results from intranet search engines is a known problem. Not for Arch! Arch achieves on intranets a
performance level that keeps Google users happy on the global Web. Too good to be true? The secret is revealed in the article
"Corporate Search: Can We Just Get Google?"
It takes 15 minutes to get Arch going. Customization is also very easy and can be done in Java or PHP – whichever is easier
for you. If you can do this:
#$> tar -xzf arch-1.9b-src.tar.gz
#$> cd arch-1.9b
#$> cd ArchHome/bin
#$> vi arch <-- insert some seed URLs into Arch crawling script
then you can install Arch. Read more in the article
"An Enterprise Search Engine in 15 minutes?"
Excellent search quality: Arch has solved the problem of providing good search results for enterprise web sites and intranets!
Up to date information: Arch is very efficient at updating indexes and this ensures that the search results are up to date and
relevant. Unlike most search engines, no complete 'recrawls' are done. The indexes can be updated daily, with new pages discovered automatically.
Watch mode: In this mode, Arch periodically checks your web servers logs and automatically adds new links to the index, if finds them.
Multiple web sites: Arch supports easy dynamic inclusion or removal of websites.
High scalability: Based on Nutch and Hadoop, Arch can run on clusters of computers and index billions of pages.
"Setup and forget": Arch can be installed by one person with limited webmaster experience. Once it has been installed,
it requires little effort to maintain.
24/7 availability: Once installed, Arch is always available.
Document level security: Arch is easy to configure and can be set up so that some parts are restricted to different users.
Users can find only documents that they have a permission to see - a must in enterprise environment.
Detection and reporting: Arch will detect and report vulnerabilities, threats and changes in your site.
Clean, high quality index: Arch lets you clean your pages before indexing, removing common fragments that should not be
indexed, such as headers, footers, menus and advertisement.
Faceted search: Arch provides faceted searches "out of the box".
Customization: Arch can be customized to specific requirements using either Java or PHP.