This loads a font easier to read for people with dyslexia.
This renders the document in high contrast mode.
This renders the document as white on black
This can help those with trouble processing rapid screen movements.

Arch Configuration Directives

Table of Contents

Glossary

Access Control
Arch
Arch Configuration
Arch Index
Bookmarks Area
Context
Crawling Depth
Crawling Iteration
Crawling Script
Default
Description
Front-end
Front-end Profile
Index Area
Index Site
Interim Index Site
Log Processing
Loglinks Area
Parallel Indexing
Sequential Indexing
Syntax
Watch Mode

Configuration Directives

Admin.ip.addresses Directive
Allowed.ip.addresses Directive
Allowed.areas Directive
Allowed.groups Directive
Allowed.sites Directive
Allowed.users Directive
Area Directive
Auth.groups.file Directive
Auth.passwords.file Directive
Authentication.scheme Directive
Blocked.ip.addresses Directive
Capture.interval Directive
CRAWLING_DEPTH Directive
CRAWLING_MAX_URLS Directive
CRAWLING_SEED Directive
CRAWLING_THREADS Directive
Database Directive
Db.driver Directive
Default.areas Directive
Default.groups Directive
Default.sites Directive
Default.users Directive
Delete.logs Directive
Depth Directive
Depth.loglinks Directive
Enabled.area Directive
Exclude.area Directive
Exclude.loglinks Directive
Facet Directive
File.bookmarks Directive
Frontend.profile Directive
Groupsread.bookmarks Directive
Hits.threshold Directive
Log.format Directive
Log.length Directive
Log.repository Directive
Logs Directive
Ignore.in.logs Directive
Include.area Directive
Interval.area Directive
Ip.filter Directive
Max.hits.norm Directive
Max.hits.day Directive
Max.hits.ip.day Directive
Max.ip.cache Directive
Max.score Directive
Max.url.length Directive
Max.urls Directive
Max.urls.area Directive
Mail.host Directive
Mail.level Directive
Mail.password Directive
Mail.transport.protocol Directive
Mail.recipient Directive
Mail.subject Directive
Mail.user Directive
Merged.retention Directive
Parallel.indexing Directive
Permissions Directive
Prune.content.types Directive
Prune.content.types.after Directive
Prune.file.types Directive
Prune.file.types.after Directive
Remove.duplicates Directive
Root.area Directive
Security.enabled Directive
Scan.alert Directive
Scan.alert.level Directive
Scan.enabled Directive
Scan.content.types Directive
Scan.file.types Directive
Scan.ignore.bits Directive
Scan.ignore.bits.after Directive
Scan.ignore.links Directive
Scan.ignore.scripts Directive
Scan.min.script.size Directive
Scan.report.changed.forms Directive
Scan.report.changed.pages Directive
Scan.report.changed.scripts Directive
Scan.report.link.changes Directive
Scan.report.new.forms Directive
Scan.report.new.pages Directive
Scan.report.new.scripts Directive
Scan.script.edges Directive
Scan.src.content.types Directive
Scan.src.file.types Directive
Sitemap.url Directive
Solr.url Directive
Target.db Directive
Temp.dir Directive
Threads Directive
Usersread.bookarks Directive
Watch.mode Directive

Glossary

The Glossary explains Arch key concepts. They are presented in alphabetical order, but, if you are new to Arch, it is recommended that you read them starting with Arch and following the "Next" links.

Access Control

In Arch index, every URL is indexed with a list of names of users and groups that are allowed to see it. This information is put in configuration files in form of permissions directives, then copied by Arch to an SQL database before crawling starts, and added to URL data by Arch indexing plugin before sending these data to Solr for indexing.

When a search request comes to Solr, the authorisation plugin adds to it user information, i.e. what are the user’s name(s) and names of groups the user belongs to. These names are used as a filter to filter the search results and reduce them to a set the user is allowed to see.

See also

Arch

Arch is an extension of the Apache Nutch, designed for efficient and effective indexing and search of organisational web sites (intranets). Arch benefits from Nutch power and flexibility and adds a number of features that are essential in corporate environments, such as access control.

Arch includes components necessary to perform indexing of web sites and to search indexed content. Apache Solr is being used as Arch search server.

See also

Arch Configuration

By default, Arch configuration is located in arch_home/conf folder and its subfolders, except configuration related to the included Jetty server, which is located in a subfolder of Jetty home folder, which is arch_home/jetty. But, it is very rarely that Arch users have to change Jetty configuration.

As Nutch and Solr are components of Arch, Arch includes and is controlled by their configuration files as well.

Arch "own" configuration consists of Arch root configuration file that contains configuration settings shared by all indexed web sites, and individual site configuration files that define parameters specific to a particular site. These parameters can override for this site the parameters defined in the root configuration file.

The most important configuration locations:

  • arch_home/conf folder contains Solr and Nucth configurations, and a folder with Arch configuration;
  • arch_home/conf/arch folder contains config.txt file that is Arch root configuration file, and subfolders with sites configuration files, each named after the site it defines.
  • Arch crawling script arch_home/bin/archhas a few parameters that can be used for quick deployment/trial run, and also affect JVM working environment, such as amount of available RAM.

See also

Arch Index

Arch index is a collection of indexed information related to pages that Arch crawler has retrieved and parsed. Text content is extracted from these pages and submitted to Solr for indexing along with metadata, such as access permissions for each page.

See also

Bookmarks Area

Bookmarks is a special area that allows adding to the index third party URLs that do not belong to any of the crawled index sites. Such URLs are provided in text files, one URL per line. The crawling depth of bookmarks area is always 1.

See also

Context

This indicates where in Arch configuration the directive makes a difference. You can place directive-like lines in Arch configuration files, but they will be ignored if Arch does not expect them there.

See also

Crawling Depth

Crawling depth in Nutch and Arch terminology means the number of crawling iterations, which is a bit confusing. However, it would be fair to say that increasing the number of iterations may increase crawling depth in the common/intuitive sense of the term.

Let's consider a hypothetical situation where we have a site consisting of 10001 pages, and the root page has 10000 links to the other pages. This site has a depth of 2 (or 1 if we don't count the root page). However, if we set Arch crawling seed to the root page, crawling depth (i.e. number of iterations) to 2 and max.urls to 5000, we will not index all URLs. On the first iteration, the root (seed) page will be fetched. The 10000 URLs will be extracted from it and 5000 of them selected to be fetched on the second iteration. So, Arch will index only 5001 of the 10001 pages in two iterations.

A seemingly trivial solution to this situation would be to increase the max.urls parameter to some very big number, but the time cost of each iteration increases non-linearly with increase of max.urls, so, this is not an acceptable solution.

See also

Crawling Iteration

Crawling iterations are a part of crawling process. On each iteration, Arch selects a number of known, but not yet accessed URLs to fetch. This number is limited by configuration max.urls (or Nutch topN) parameter. Arch then fetches the selected URLs, parses them, extracts new links and indexable text content, saves this to a database and moves to the next iteration.

This process finishes when there are no unprocessed known URLs left or the number of iterations has reached the configured maximal crawling depth.

See also

Crawling Script

Arch crawling script is located in arch_home/bin folder and is called arch. It can be used for a quick start/evaluation of Arch, as well as for normal day-to-day use.

When started, among other things, it checks if there is another instance of Arch running, and exits if it finds evidence of it. Therefore, it is safe to schedule cron to start it every day, even if some runs may take longer than a day.

The crawling script starts Arch indexer, that processes newly available logs if configured to do so, then finds all areas that are due to be indexed, and indexes them.

See also

Description

In this document, a brief description of the purpose of the directive.

See also

Default

If the directive has a default value (i.e., if you omit it from your configuration entirely, Arch will behave as though you set it to a particular value), it is described here. If there is no default value, this section should say "none".

Note that the default listed here is not necessarily the same as the value the directive takes in the default configuration files distributed with Arch.

See also

Front-end

A front-end is a (most often) PHP gateway that uses Arch Solr server to search the index. Arch includes source code of a PHP based front-end that requires minimal customisation and can be asily deployed by IT personnel familiar with PHP. Thus, Arch can be customised by people familiar with Java - using Nutch and Solr plugins and configuration files, or PHP – using PHP front-ends.

Front-ends can be used to authorise access to Arch index and/or limit or filter search for other reasons, for example, to increase search effectiveness and relevance of results. They can do it by changing/augmenting queries that searchers send through them.

See also

Front-end Profile

Front-end access to Arch index is configured using the frontend.profile directive. This directive contains front-end name and password, and, optionally, names of users and groups that this front-end is authorised to represent. If these names are omitted, the front-end is not limited to any particular users or groups. If these names are present and the front-end sends with a request a name that is not on the list, such a name is ignored.

Search results accessible via front-end are limited to those that are visible to users and/or groups on the list that the front-end sends with request. This list of names must be a subset of names listed in the frontend.profile directive.

A front-end can have access to all sites contents in the index, if its profile directive is placed in the root configuration file. If such a directive is placed in a site configuration file, the search via this front-end is limited to this site entries only.

See also

Index Area

Index area is a subset of index site, both in terms of configuration and index contents. Index area is a unit of crawling and search in Arch. It is possible to crawl only a particular index area of a site, ignoring all other areas. It is also possible to search only a particular index area of a site, ignoring all other areas. This allows filtering search results for relevance or access control reasons.

Each index site can have as many index areas as needed, and they can be crawled with different frequencies.

Index area is defined by its name and a set of roots, include and exclude URL prefixes. A URL belongs to an area if it is a root of this area or it matches at least one of the include prefixes and does not match any of the exclude prefixes. A URL can belong to more than one area.

See also

Index Site

Index site is a unit of configuration in Arch. Each particular web site can have related configuration file(s), web server logs and a set of indexed pages and metadata. Normally, each indexed web site has its own configuration file. These files are optional, however. In simple cases when no advanced options are necessary and sites are defined by seed URLs inserted in Arch crawling script, Arch creates interim index site configurations "on the fly".

See also

Interim Index Site

In simple cases, for example when Arch is used for trial, if no advanced options are necessary and sites are defined by seed URLs inserted in Arch crawling script, Arch creates interim index site configurations "on the fly". When site configuration files are created, they override these interim configurations.

See also

Log Processing

Arch can use web server logs to improve quality of indexing and search:

  • Based on statistics of document access extracted from logs, Arch identifies important (popular) documents and ranks them higher in search results.
  • Logs allow finding isolated documents that don't have a chain of links leading to them from crawling seed. These documents would not be found by conventional crawling algorithms based on following links to discover new pages.
  • An instance of Arch can be configured to work in watch mode, when it checks server logs for new links every few minutes, indexing detected new pages automatically and almost instantaneously.

See also

Loglinks area is a special area used to configure crawling of URLs found in web server logs and not reached by "normal" crawling process. Disabling loglinks area in site or root configuration disables log processing and use of log URLs.

See also

Parallel Indexing

In parallel indexing mode, Arch uses as crawling seed roots of all areas, due to be indexed, except bookmarks. It also adds to the seed set all URLs belonging to these areas found in logs. It then crawls all these URLs in parallel, performing a number of iterations up to the maximum defined in the root configuration file or Arch crawling script.

As all known URLs are crawled in parallel, and Nutch ensures that pages are accessed only once, there is no risk of creating of high number of duplicates, as when crawling loglinks areas in sequential mode.

Please also note that a depth and max.urls combination of parameters that are sufficient to crawl areas one by one in sequential mode, may be insufficient in parallel mode, because in this mode all URLs (except bookmarks) are being crawled in parallel. Thus, their number usually is considerably higher than the number of URLs in the largest area.

See also

Sequential Indexing

In sequential indexing mode, Arch indexes all areas one by one. Area specific configuration parameters, such as crawling depth and number of concurrent threads, override parameters defined in the root configuration file. For each area, Arch uses its configured roots as crawling seed and then performs configured as crawling depth number of crawling iterations for this area.

This mode is most convenient when troubleshooting indexing. If Arch encounters a problem when indexing an area, it can be interrupted, configuration changed to address the problem, and after restarting, Arch will continue indexing, starting from the area where it was interrupted.

Arch processes loglinks area of each site after it has processed all other areas (except bookmarks). When processing loglinks areas, Arch uses as seeds URLs that have been found in logs, but had not been accessed when crawling other areas. Setting depth of loglinnks area crawling to 1 will index only these pages, but not other potentially unindexed URLs that these pages may have. On the other hand, setting crawling depth for loglinks areas to a high value will result in re-crawling pages that have been crawled before. It is recommended to set this depth to 2, which seems to be an acceptable compromise.

See also

Syntax

This indicates the format of the directive as it would appear in a configuration file. Syntax is extremely directive-specific, and is described in detail in the directive's definition. Generally, the directive name is followed by a "=" character and then by a series of one or more arguments separated by spaces or commas, or oter characters. Optional arguments are enclosed in square brackets. Where an argument can take on more than one possible value, the possible values are separated enclosed in braces. Directives which can take a variable number of arguments will end in "..." indicating that the last argument is repeated.

See also

Watch mode

An instance of Arch can be used to monitor web server log files, and, if it finds new links in them, extract these links and fetch and index pages that these links are pointing too. This allows almost immediate indexing of new pages, without a need to submit them manually. Working in watch mode is not expensive, however, it requires a dedicated instance of Arch because same instance can't be working in watch mode and performing normal indexing operations at the same time.

Directives List

Admin.ip.addresses Directive

Description:Optioanl list of privileged IP addresses.
Syntax:admin.ip.addresses=subnet mask [subnet mask...]
Default:off
Context:Root configuration file

A list of IP address to allow admin access to Solr server from. Clients from these addresses have full unfiltered access to Solr server and are be able to updade and delete the contents there. Note that if Arch works in a cluster configuration, addresses of all computers in the cluster have to be on this list. Leave this parameter commented out to allow admin access from any IP address.

A general advice: for trial runs, turn off security measures so that they do not cause problems. Else you may experience problems and spend a long time looking for the cause just to find out that, for example, you've indexed everything correctly, but your queries do not return anything just becuase your access permissions are too strict.

Please note that regular expressions depend on whether IPv4 or IPv6 is used. The examples below are valid for IPv4.

Example:

admin.ip.addresses = ^127.0.0.1
      

See also

Allowed.ip.addresses Directive

Description:Optioanl white list of privileged IP addresses.
Syntax:allowed.ip.addresses=subnet mask [subnet mask...]
Default:off
Context:Root configuration file

Optional white list of privileged IP addresses. Requests coming from these addresses are let through, unless authentication is explicitly requested.

allowed.users, allowed.groups, allowed.sites and allowed.areas are automatically assigned to unauthenticated requests from IP addresses on this list.

Please note that regular expressions depend on whether IPv4 or IPv6 is used. The examples below are valid for IPv4.

Example:

allowed.ip.addresses = ^130\.155\.17[6-9]\..+ ^130\.155\.18[6-9]\..+
      

See also

Allowed.areas Directive

Description:Optioanl list of areas visible to unauthenticated requests coming from IP addresses on the white list.
Syntax:allowed.groups=group name [group name...]
Default:off
Context:Root configuration file

Optioanl list of areas visible to unauthenticated requests coming from IP addresses on the white list.

Use all or omit this directive to make all areas visible.

Example:

allowed.areas = all
      

See also

Allowed.groups Directive

Description:Optioanl list of group names assigned to unauthenticated requests coming from IP addresses on the white list.
Syntax:allowed.areas=area name [area name...]
Default:off
Context:Root configuration file

Optioanl list of group names assigned to unauthenticated requests coming from IP addresses on the white list.

Use all or omit this directive to assign all group names.

Example:

allowed.groups = staff public
      

See also

Allowed.sites Directive

Description:Optioanl list of sites visible to unauthenticated requests coming from IP addresses on the white list.
Syntax:allowed.sites=site name [site name...]
Default:off
Context:Root configuration file

Optioanl list of sites visible to unauthenticated requests coming from IP addresses on the white list.

Use all or omit this directive to make all sites visible.

Example:

allowed.sites = all
      

See also

Allowed.users Directive

Description:Optioanl list of user names assigned to unauthenticated requests coming from IP addresses on the white list.
Syntax:allowed.users=user name [user name...]
Default:off
Context:Root configuration file

Optioanl list of user names assigned to unauthenticated requests coming from IP addresses on the white list.

Use "all" or omit this directive to assign all user names.

Example:

allowed.users = staff guest
      

See also

Area Directive

Description:Declares a site area name.
Syntax:area=area name
Default:nove
Context:Site configuration files

Area name. Must be alphanumeric, up to 30 characters long. A site can have a practically unlimited number of areas.

Use this name to label other area configuration parameters.

Example:

area = main
      

See also

Auth.groups.file Directive

Description:Location of groups file.
Syntax:auth.groups.file=file name
Default:none
Context:Root configuration file

Location of user groups file - a parameter required by Arch reference authentication plugin implementing Apache file based authentication scheme. You can replace this plugin by a plugin implementing authentication scheme used in your organisation.\

Example:

auth.groups.file = /opt/arch-1.4/conf/arch/testGroups.txt
      

See also

Auth.passwords.file Directive

Description:Location of passwords file.
Syntax:auth.passwords.file=file name
Default:none
Context:Root configuration file

Location of passwords file - a parameter required by Arch reference authentication plugin implementing Apache file based authentication scheme. You can replace this plugin by a plugin implementing authentication scheme used in your organisation.

Example:

auth.passwords.file = /opt/arch-1.4/conf/arch/testPasswords.txt
      

See also

Authentication.scheme Directive

Description:Sets either global or site specific authentication scheme
Syntax:authentication.scheme=scheme name
Default:none
Context:Rite configuration file, site configuration file

This parameter helps to find a proper authentication plugin to use for authentication in JSP interface. Arch searches for a plugin that has an attribute scheme (declared in plugin.xml) matching the authentication.scheme parameter. If such plugin is found, it is used for authentication.

Note: unless parameter domain is sent with request, all authentication related parameters are taken from the root configuration file and must be declared there. If domain is sent in request, it must match name of a site. All authentication related parameters will be taken from that site configuration file and access will be limited to that site data only.

Example:

authentication.scheme = file
      

See also

Blocked.ip.addresses Directive

Description:Optioanl black list of IP addresses.
Syntax:Blocked.ip.addresses=subnet mask [subnet mask...]
Default:off
Context:Root configuration file

Optional black list of IP addresses. Requests coming from these addresses are rejected.

Please note that regular expressions depend on whether IPv4 or IPv6 is used. The examples below are valid for IPv4.

Example:

blocked.ip.addresses = ^130\.155\.201\.106
      

See also

Capture.interval Directive

Description:Time interval in seconds for which access history is retained in IP filtering.
Syntax:capture.interval=number
Default:300
Context:Root configuration file, site configuration files

This interval is used when identifying IP addresses of automatic clients to ignore. If more hits than hits.threshold has come from an IP address in a time interval shorter than capture.interval, this IP address is considered belonging to a robot and put on a list to ignore.

Example:

capture.interval = 300
      

See also

CRAWLING_DEPTH Directive

Description:Defines number of crawling iterations.
Syntax:CRAWLING_DEPTH=number of iterations
Default:30, except for loglinks and bookmarks areas where it is 1.
Context:Arch crawling script

Defines how many crawling iterations to do. After deploying Arch, it is recommended to do a trial crawl with a shallow depth, e.g. 2. If everything works, run the bin/clean script, set the depth to a desired value and do a production crawl.

If this directive is present, it overrides directive depth in root configuration. It also overrides directive depth in area configuration if crawling is being performed in parallel mode.

Example:

CRAWLING_DEPTH=10
      

See also

CRAWLING_MAX_URLS Directive

Description:Maximal number of urls to fecth at each crawling iteration.
Syntax:CRAWLING_MAX_URLS=number
Default:10000
Context:Arch crawling script

Max number of urls to fetch at each crawling iteration. This is passed to Nutch as the topN parameter.

The total number of indexed URLs is limited to the crawling depth (number of crawling iterations) multiplied by max urls.

Example:

CRAWLING_MAX_URLS=10000
      

See also

CRAWLING_SEED Directive

Description:Defines seed URLs for crawling
Syntax:CRAWLING_SEED="{URL, file name} [ | {URL, file name}... ]"
Default:none
Context:Arch crawling script

This directive provides seed URLs for crawling in simple cases or when Arch is being evaluated. This option allows to avoid creating site configuration files. Instead, an interim site configuration is automatically created for each domain found in the URLs. If these domains overlap with domains of existing (explicitly configured) sites, en error will be reported and processing stopped.

If a file name is used in the directive, the file must contain seed URLs, one URL per line.

Example:

CRAWLING_SEED="http://www.example.com/index.html |\
               /var/urlsToCrawl.txt |\
               http://www.other.com" 
      

Please note the quotes and line continuation characters (\).

See also

CRAWLING_THREADS Directive

Description:Defines number of parallel threads to use for crawling.
Syntax:CRAWLING_THREADS=threads number
Default:10
Context:Arch crawling script

This directive is equivalent to Nutch fetcher.threads.fetch configuration property. It is the number of FetcherThreads the fetcher should use. This also determines the maximum number of requests that are made at once (each FetcherThread handles one connection). The total number of threads running in distributed mode will be the number of fetcher threads * number of nodes as fetcher has one map task per node.

If this directive is present, it overrides directive threads in root configuration. It also overrides directive threads in area configuration if crawling is being performed in parallel indexing mode.

Example:

CRAWLING_THREADS=10
      

See also

Database Directive

Description:Defines the default database to use.
Syntax:database=database type
Default:H2
Context:Root configuration file, site configuration files

Defines the default database to use. Arch comes with plugins for H2, Derby and MySQL databases.

Site configurations can override this directive and store site related data in a different database.

Example:

database = H2
      

See also

Db.driver Directive

Description:Defines the db driver class.
Syntax:db.driver=db driver class
Default:org.h2.Driver
Context:Root configuration file, site configuration files

Defines the default database driver class to use. Arch comes with drivers for H2, Derby and MySQL databases.

If site configurations override the database directive, they should override the db.driver directive as well.

Example:

db.driver = org.h2.Driver
      

See also

Default.areas Directive

Description:Optioanl list of areas visible to unauthenticated requests coming from IP addresses not matching white or black list.
Syntax:default.areas=area name [group name...]
Default:off
Context:Root configuration file

Optioanl list of areas visible to unauthenticated requests coming from IP addresses not matching white or black list.

Use all or omit this directive to make all areas visible.

Example:

default.areas = all
      

See also

Default.groups Directive

Description:Optioanl list of group names assigned to unauthenticated requests coming from IP addresses not matching white or black list.
Syntax:default.groups=group name [group name...]
Default:off
Context:Root configuration file

Optioanl list of group names assigned to unauthenticated requests coming from IP addresses not matching white or black list.

Use all or omit this directive to assign all group names.

Example:

default.groups = public
      

See also

Default.sites Directive

Description:Optioanl list of sites visible to unauthenticated requests coming from IP addresses not matching white or black list.
Syntax:default.sites=site name [site name...]
Default:off
Context:Root configuration file

Optioanl list of sites visible to unauthenticated requests coming from IP addresses not matching white or black list.

Use all or omit this directive to make all sites visible.

Example:

default.sites = all
      

See also

Default.users Directive

Description:Optioanl list user names assigned to unauthenticated requests coming from IP addresses not matching white or black list.
Syntax:default.users=user name [user name...]
Default:off
Context:Root configuration file

Optioanl list of user names assigned to unauthenticated requests coming from IP addresses not matching white or black list.

Use all or omit this directive to assign all user names.

Example:

default.users = guest
      

See also

Delete.logs Directive

Description:Switches On or Off log files deleting after log processing.
Syntax:delete.logs={on,off}
Default:off
Context:Root configuration file, site configuration files

Delete log files after processing. Switching this on is convenient for setting up automatic log processing. All you have to do is keep copying the latest log files to the location where Arch expects them. Arch will find them, process and delete.

Example:

delete.logs = off
      

Depth Directive

Description:Defines default number of crawling iterations for all areas, except loglinks and bookmarks
Syntax:depth=number of iterations
Default:30
Context:Root configuration file

Default crawling depth. Defines how many crawling iterations to do by default when indexing an area. Each area can overwrite this parameter, if indexed sequentially. After deploying Arch, it is recommended to do a trial crawl with a shallow depth, e.g. 2. If everything works, run the bin/clean script, set the depth to a desired value and do a production crawl.

Example:

depth = 30
      

See also

Depth.area Directive

Description:Defines number of crawling iterations for this area,
Syntax:depth.area name=number of iterations
Default:30
Context:Site configuration files

Defines how many crawling iterations to do when indexing this area. This directive is overriding the global depth directive only when sequential crawling is being done. It is ignored when areas are crawled in parallel.

After deploying Arch, it is recommended to do a trial crawl with shallow depth, e.g. 2. If everything works, reset the index, set the depth to a desired value and do a production crawl.

Example:

depth.documentation = 30
      

See also

Description:Defines number of crawling iterations for log links
Syntax:depth.loglinks=number of iterations
Default:1
Context:Root configuration file, site loglink area configuration

Default crawling depth for links found in log files. Only effective when sequential indexing is being done. The value defined in the root configuration file can be overwritten in sites loglinks areas configurations.

In sequential indexing, after indexing all areas of a site, links that are found in the site logs, but have not been crawled yet, are used as starting points for another round of crawling. It is not recommended that this parameter is set to higher than 2 because in this case too many URLs that have already been crawled in previous stages will be re-fetched and create duplicated entries in the index. If your site has too many isolated areas which are discovered only via log links, the parallel indexing mode is the recommended option. In parallel indexing, all pre-configured crawling seeds (roots) together with all links found in logs are used as starting points of crawling with no risk of creating many duplicates because they are crawled together and Nutch is making sure that no link is fetched twice.

Example:

depth.loglinks = 1
      

See also

Enabled.area Directive

Description:Enables or disables a site area.
Syntax:enabled.area name={on,off}
Default:on
Context:Site configuration files

Enables or disables a site area. If this directive os off, all area related directives are ignored.

Example:

enabled.main = on
      

See also

Exclude.area Directive

Description:Defines a prefix of URLs excluded from the area.
Syntax:exclude.area name=URL prefix
Default:none
Context:Site configuration files

Defines a prefix of URLs excluded from the area. A url is included in the area if it is a root (seed) of the area or matches (starts with) at least one inclusion and does not match any exclusions.

Example:

include.internal = http://www.mycom.com/internal
exclude.internal = http://www.mycom.com/internal/management
      

See also

Description:Defines a prefix of path of URLs ignored in logs.
Syntax:exclude.loglinks=path prefix
Default:none
Context:Site configuration files

Defines a prefix of path of URLs ignored when processing logs of the site. Note that URLs are also ignored if they are rejected by filters in Nutch configuration.

Note that in this directive path prefixes are used instead of URL prefixes because the scope of this directive is limited to one site.

Example:

exclude.loglinks = /images/
exclude.loglinks = /css/
exclude.loglinks = /temp/
      

See also

Facet Directive

Description:Switches On or Off faceting of sites, areas and formats.
Syntax:facet.{sites,areas,formats}={on,off}
Default:facet.sites=on, facet.areas=off, facet.formats=on
Context:Root configuration file, site configuration files

Controls faceted search. Faceting directives set for sites override root configuration if parameter domain is used in the request.

Another method to override faceting directives is to add facet=true and other Solr faceting parameters to the request. If Arch finds a "facet field in request, it ignores configuration parameters. Set facet.sites and facet.areas to on if have more than one site and area.

Example:

facet.sites=on
      

See also

File.bookmarks Directive

Description:Provides a file containing bookmark URLs.
Syntax:file.bookmarks=path to file
Default:none
Context:Site configuration files

Bookmark files contain bookmark URLs to index. There can be several bookmark files used in one site. Bookmark URLs can point to third party sites.

Example:

file.bookmarks = /opt/arch/conf/arch/sites/mySite/bookmarks1.txt
file.bookmarks = /opt/arch/conf/arch/sites/mySite/bookmarks2.txt
      

See also

Frontend.profile Directive

Description:Front-end id, password, sites and areas that the front-end is allowed to search and users and groups that are allowed to do search via this front-end
Syntax:frontend.profile=id password [| site name ... [| area name ... [| group name ... [| user name ... ]]]]
Default:none
Context:Root configuration file

Front-end search profile parameter defines front-end id, password, sites and areas that the front-end is allowed to search and users and groups that are allowed to do search via this front-end. The fields are separated by |. The required parameters are the id and password. The rest can be left blank.

Note: frontent.profile parameter is expected by Arch reference authentication plugin. It is very likely that you will want to replace it with a plugin implementing authentication method used in your organization. Your plugin may use a different configuration parameters set.

Note: unless parameter domain is sent with request, front-end authentication related parameters are taken from the root configuration file and must be declared there. If domain is sent in request, it must match a site name. In this case all authentication related parameters are taken from that site configuration file and search is limited to that site data only.

Please note that regular expressions depend on whether IPv4 or IPv6 is used. The examples below are valid for IPv4.

Example:

frontend.profile = global | pass1
      

See also

Groupsread.bookmarks Directive

Description:Lists names of groups allowed to see indexed bookmarks.
Syntax:groupsread.bookmarks=group name[ group name ...]
Default:public
Context:Site configuration files

Lists names of groups allowed to see indexed bookmarks.

Example:

groupsread.bookmarks = staff public
      

See also

Hits.threshold Directive

Description:Max reasonable number of hits an IP address may generate within a capture interval.
Syntax:capture.interval=number
Default:30
Context:Root configuration file, site configuration files

If an IP address generates more than this number of hits within a capture interval, it is considered belonging to a robot and ignored in the future.

In this example, IP address is blocked if there are more than 30 accesses to text documents in a 5 minutes interval.

Example:

capture.interval = 300
hits.threshold = 30
      

See also

Ignore.in.logs Directive

Description:Ignore log records of requests for certain file types.

This directive is depricated and ignored.

Ignored URLs are now defined by filters in Nutch configuration, such as the regex URL filter and the suffix URL filter.

Arch is an extension of Nutch, and most Nutch confoguration options are effective in Arch.

See also

Include.area Directive

Description:Defines a prefix of URLs included in the area.
Syntax:include.area name=URL prefix
Default:none
Context:Site configuration files

Defines a prefix of URLs included in the area. A url is included in the area if it is a root (seed) of the area or matches (starts with) at least one inclusion and does not match any exclusions.

Example:

include.internal = http://www.mycom.com/internal
      

See also

Interval.area Directive

Description:Defines area re-indexing interval.
Syntax:interval.area name=NN [,sun][,mon][,tue][,wed][,thu][,fri][,sat]
Default:0
Context:Site configuration files

Re-indexing interval (days) with weekdays on which re-indexing is allowed. For example, to re-index area approx. every 20 days, but only on weekends, see the example below. The default value is 0, which means that the area will be re-crawled each time Arch is started.

Example:

interval.documentation = 20, sat, sun
      

See also

Ip.filter Directive

Description:Switches On or Off IP filtering when log processing.
Syntax:ip.filter={on,off}
Default:on
Context:Root configuration file, site configuration files

Switches on or off attempts to identify and ignore log records caused by search engines and other non-human clients based on IP stats. All IP based filtering can be turned off by setting this parameter to off. Robots accesses can still be filtered out based on the client type, if this information is in the logs and robots do not masquerade as browsers.

Example:

ip.filter = on
      

See also

Log.format Directive

Description:Defines plugin to use for log parsing.
Syntax:log.format=format type
Default:combined
Context:Root configuration file, site configuration files

Value in a site configuration file, if provided, overrides value in the root configuration file.

This parameter must match format attribute (defined in plugin.xml file) of a log parser plugin that is able to process logs of this type. The default arch log parser works with logs in combined format.

Example:

log.format = combined
      

Log.length Directive

Description:Defines effective log length, in days.
Syntax:log.length=number
Default:365
Context:Root configuration file, site configuration files

Defines effective log length, in days. If log records are available for longer then defined by this directive, the older records are ignored. E.g. if we have logs for the last 10 years, but the value of log.length is 365, only the records of the the latest 365 days are used to compute document scores.

Example:

log.length = 365
      

Log.repository Directive

Description:Defines location for pre-processed log files.
Syntax:log.repository=path [| file_mask [ | file mask...]]
Default:none
Context:Site configuration files

Sometimes logs need merging before they can be processed correctly, for example, if site has both HTTP and HTTPS access. If log.repository option is provided, logs will be read from the log directories (as specified by the log option above), merged and placed into the first directory specified by the log.repository option. After that, they will be used for "normal" processing. Note that log masks are used exactly as they are used in the logs option. In fact, if there is no logs option provided and the default logs directory is empty or absent, log.repository option will be used instead, if present, and no log pre-processing done.

Example:

log.repository = file:///var/logs/www.atnf.csiro.au/ | ^merged_log_2019-.+ 
      

Log.repository Directive

Description:Defines location for pre-processed log files.
Syntax:log.repository=path [| file_mask [ | file mask...]]
Default:none
Context:Site configuration files

Sometimes logs need merging before they can be processed correctly, for example, if site has both HTTP and HTTPS access. If log.repository option is provided, logs will be read from the log directories (as specified by the log option above), merged and placed into the first directory specified by the log.repository option. After that, they will be used for "normal" processing. Note that log masks are used exactly as they are used in the logs option. In fact, if there is no logs option provided and the default logs directory is empty or absent, log.repository option will be used instead, if present, and no log pre-processing done.

Example:

log.repository = file:///var/logs/www.atnf.csiro.au/ | ^merged_log_2019-.+ 
      

See also

Logs Directive

Description:Defines source location of log files.
Syntax:logs=path [| file_mask [ | file mask...]]
Default:none
Context:Site configuration files

Location of log files. Can be a local or remote directory with a number of file masks defined by regular expressions. If no file masks are provided, all files match. Files found in this directory must be web server log files of this log site. Multiple locations can be used simultaneously.

Example:

logs = file:///var/logs/www.atnf.csiro.au/ | ^latest.+ | ^access\.2019-.+ 
logs = sftp://arch:mypass@myhost:22/var/log/www.atnf.csiro.au/ | ^access\.log-2019.+

      

See also

Max.hits.norm Directive

Description:Number of hits from IP to URL per day beyond which the IP is placed on the list to ignore.
Syntax:max.hits.norm=number
Default:1000
Context:Root configuration file, site configuration files

Number of hits from IP to URL per day beyond which the IP is placed on the list to ignore.

It should be used carefuly because some pages, e.g. home pages, may be requested often in normal use. The default number of 1000 prectically disables use of this parameter.

Example:

max.hits.norm = 1000
      

See also

Max.hits.day Directive

Description:Max number of accesses to a page per day that counts.
Syntax:max.hits.day=number
Default:5000
Context:Root configuration file, site configuration files

Max number of accesses to a page per day that counts, the rest are ignored.

This parameter should be set to a value that is the highest reasonable estimate of accesses to a page per day, excluding accesses by automatic clients.

Example:

max.hits.day = 5000
      

See also

Max.hits.ip.day Directive

Description:Max number of accesses to a page per day from a sningle IP address that counts.
Syntax:max.hits.ip.day=number
Default:5
Context:Root configuration file, site configuration files

Max number of accesses to a page from a single IP per day that counts, the rest are ignored.

Example:

max.hits.ip.day = 5
      

See also

Max.ip.cache Directive

Description:Max size of memory held IP address information cache.
Syntax:max.ip.cache=number
Default:100000
Context:Root configuration file

Max size of memory held IP address information cache. This cache is used when computing a list of ignored IP addresses with aim to count only accesses generated by human readers and exclude accesses generated by robots. A bigger cache speeds up ignored IPs list generation, but requires more memory.

Example:

max.ip.cache = 100000
      

See also

Max.score Directive

Description:Max document weight value to use for final weight normalisation in the DB.
Syntax:max.score=number
Default:5
Context:Root configuration file

Max document weight value that is stored with a document in Solr (Lucene) index. This parameter can be changed to tune the effect of document weights derived from logs on Solr results ranking. In our experiments, 5 was the optimal value.

Example:

max.score = 5
      

See also

Max.url.length Directive

Description:Maximal length beyond which log URLs are ignored.
Syntax:max.url.length=number
Default:300
Context:Root configuration file

Too long URLs extracted from logs are often a sign of hacker activity and should be ignored together with IP addresses that generated them. Use this parameter to ignore URLs longer than its value. Set to -1 to not limit the length.

Example:

max.url.length = 300
      

Max.urls Directive

Description:Default maximal number of urls to fecth at each crawling iteration.
Syntax:max.urls=number
Default:10000
Context:Root configuration file

Default max number of urls to fetch at each crawling iteration. This is passed to Nutch as the topN parameter.

The total number of indexed URLs is limited to the crawling depth (number of crawling iterations) multiplied by max.urls.

If present in Arch crawling script, CRAWLING_MAX_URLS directive overrides this directive.

Example:

max.urls = 10000
      

See also

Max.urls.area Directive

Description:Depricated, ignored.
Syntax:max.urls.area name=number of URLs
Default:30
Context:Site configuration files

This directive is ignored. All areas are crawled with max URLs parameter defined in the root configuration file or the Arch crawling script.

See also

Mail.host Directive

Description:Mail server address. Required.
Syntax:mail.host=mail server address
Default:none
Context:Root configuration file, site configuration files

Sets the mail server to use to post mail.

Site configuration directives, if present, override root configuration directives.

Example:

mail.host = smtp.mycom.com
      

See also

Mail.level Directive

Description:Level of details in mail messages.
Syntax:mail.level=level
Default:INFO
Context:Root configuration file, site configuration files

There are five levels: DEBUG, INFO, WARN, ERROR, OFF. The DEBUG level is most detailed, the OFF level switches mail off.

Note that, even if mail.level is not OFF, email will not be sent if other mail related parameters are missing or invalid.

Site configuration directives, if present, override root configuration directives.

Example:

mail.level = OFF
      

Mail.password Directive

Description:Password to use if posting mail requires authentication.
Syntax:mail.password=password
Default:none
Context:Root configuration file, site configuration files

Sets the mail server password to use if posting mail requires authentication.

Site configuration directives, if present, override root configuration directives.

Example:

mail.password = My.name.is.Bond,James.Bond.
      

See also

Mail.transport.protocol Directive

Description:Mail transport protocol to use. Optional.
Syntax:mail.transport.protocol=protocol
Default:smtp
Context:Root configuration file, site configuration files

Sets the mail transport protocol to use to post mail.

Site configuration directives, if present, override root configuration directives.

Example:

mail.transport.protocol = smtp
      

See also

Mail.recipient Directive

Description:Email addresses to send messages to. Required.
Syntax:mail.recipient=address[{;,:,,}address...]
Default:none
Context:Root configuration file, site configuration files

Sets email addresses to send messages to.

Site configuration directives, if present, override root configuration directives. If this directive is present in a site configuration file, email messages related to the site are sent to address defined in its configuration, and their text is not included in the combined message sent to the address configured in the root configuration file.

To cause sending a separate email for a site, define at least mail.recipient parameter in its configuration. The rest of the mail related parameters are optional, as long as they are defined in the global config file.

Example:

mail.recipient = address1@mycompany.com, address2@anothercompany.com
      

See also

Mail.subject Directive

Description:Sets the subject line for mail messages from Arch.
Syntax:mail.subject=subject line
Default:none
Context:Root configuration file, site configuration files

Sets the subject line for mail messages from Arch.

Site configuration directives, if present, override root configuration directives.

Example:

mail.subject = My message from Arch
      

See also

Mail.user Directive

Description:User name to use if posting mail requires authentication.
Syntax:mail.user=user name
Default:none
Context:Root configuration file, site configuration files

Sets the mail user name to use if posting mail requires authentication.

Site configuration directives, if present, override root configuration directives.

Example:

mail.user = bond007
      

See also

Merged.retention Directive

Description:Number of days to keep merged logs in logs repository.
Syntax:merged.retention=number
Default:-1
Context:Site configuration files

Specifies for how long to store merged logs in log repository before deleting them. -1 means no time limit.

Example:

merged.retention = 365
      

See also

Parallel.indexing Directive

Description:Switches On or Off parallel indexing
Syntax:parallel.indexing={on, off}
Default:off
Context:Root configuration file

If off, areas are crawled sequentially, one at a time. The sequential mode is recommended for troubleshooting a new installation or after a new site has been added. In this mode, if indexing fails, you can fix the problem and restart indexing. It will skip areas that have been processed successfully in the previous run. The parallel mode may significantly decrease the time of indexing. Note that bookmarks areas will be processed sequentially even if parallel processing is switched on.

In parallel indexing mode, all pre-configured crawling seeds (roots) together with all links found in logs are used as starting points of crawling.

Example:

parallel.indexing = off
      

See also

Permissions Directive

Description:Sets access permissions for a file or folder.
Syntax:permissions={f,d} | URL | groups-R/O | groups-R/W | users-R/O | users-R/W | owners | {s,i}
Default:see below
Context:Site configuration file

This parameter is used to set access permissions for a file or folder (and by default, it’s subfolders). These permissions make effect when affected documents are re-indexed.

In the syntax definition:

  • 'f' or 'd' stand for "folder" or "document" respectively;
  • groups-R/O – a space separated list of user groups having R/O access;
  • groups-R/W – a space separated list of user groups having R/W access;
  • users-R/O – a space separated list of users having R/O access;
  • users-R/W – a space separated list of users having R/W access;
  • owners – a space separated list of users having administrator access;
  • 's' or 'i' stand for defined or inherited permissions mode respectively.

If inherited mode is set, all user and group lists, including owners, are inherited from the parent folder. This setting practically disables the permissions directive because all permissions are inherited from the parent, which is default mode.

By default, site root has an implicit permissions declaration like this:

f | http://mysite.com/ | public | null | guest | null | admin | s

All other folders and documents inherit permissions from their parent folders.

The example below limits access to the internal part of the site to the staff group only.

Example:

permissions = f | http://mysite.com/  | public | staff | admin | admin | admin | s 
permissions = f | http://mysite.com/internal/ | staff | staff | admin | admin | admin | s 
      

See also

Prune.content.types Directive

Description:Lists content types of pages to prune.
Syntax:prune.content.types=content type [| content type ...]
Default:none
Context:Root configuration file, site configuration files

Server output of pages with these content types will be prunned.

Site configuration directives, if present, override root configuration directives

There can be several directives in one configuration file. Their contents are merged.

Example:

prune.content.types = text/html | text/javascript
      

See also

Prune.content.types.after Directive

Description:Lists content types of pages to prune after textual content is extracted by parser.
Syntax:prune.content.types.after=content type [| content type ...]
Default:none
Context:Root configuration file, site configuration files

Extracted (by document parsers) text content of pages of these content types will be cleaned before indexing by removing fragments defined by the scan.ignore.bits.after parameter.

Note that files of types listed in the prune.content.types parameter are processed before parsing the document. This kind of processing is relatively easy to apply to text based files, such as HTML and PHP files. The prune.content.types.after parameter lists content types of files that are to be pruned after parsing them (extracting textual content from them).

To tune pruning, set temporarily log4j.logger.au.csiro.cass.arch.security.BasicPruner to TRACE in conf/log4j.properties file to see what input the pruner gets.

Site configuration directives, if present, override root configuration directives

There can be several directives in one configuration file. Their contents are merged.

Example:

prune.content.types.after = application/msword | application/pdf
      

See also

Prune.file.types Directive

Description:Lists file types of pages to prune.
Syntax:prune.file.types=file type [[|] file type ...]
Default:none
Context:Root configuration file, site configuration files

Server output of pages with these file types will be prunned.

Site configuration directives, if present, override root configuration directives

There can be several directives in one configuration file. Their contents are merged.

Example:

prune.file.types = htm html asp aspx do
prune.file.types = php php3 php4 php5
      

See also

Prune.file.types.after Directive

Description:Lists file types of pages to prune after textual content is extracted by parser.
Syntax:prune.file.types.after=file type [[|] file type ...]
Default:none
Context:Root configuration file, site configuration files

Extracted (by document parsers) text content of pages of these file types will be cleaned before indexing by removing fragments defined by the scan.ignore.bits.after parameter.

Note that files of types listed in the prune.file.types parameter are processed before parsing the document. This kind of processing is relatively easy to apply to text based files, such as HTML and PHP files. The prune.file.types.after parameter lists types of files that are to be pruned after parsing them (extracting textual content from them).

To tune pruning, set temporarily log4j.logger.au.csiro.cass.arch.security.BasicPruner to TRACE in conf/log4j.properties file to see what input the pruner gets.

Site configuration directives, if present, override root configuration directives

There can be several directives in one configuration file. Their contents are merged.

Example:

prune.file.types.after = doc docx pdf
      

See also

Remove.duplicates Directive

Description:Switches On or Off removing of indfex duplicates.
Syntax:remove.duplicates={on, off}
Default:on
Context:Root configuration file

If switched on, Arch will attempt to remove duplicated entries in index. Such duplicates may be caused by use of URL aliases in indexed sites, or crawling log links with depth higher than one.

Arch will remove entries that have identical hash sums and occur in the same area. Index areas may overlap by design, and entries belonging to different areas are not considered being duplicates even if they are identical.

Removing of duplicates takes considerable time. This time can be reduced if pages have defined canonical URLs in their metadata.

Example:

remove.duplicates = on
      

See also

Root.area Directive

Description:Defines a seed URL for crawling of the area.
Syntax:root.area name=URL
Default:none
Context:Site configuration files

Defines a seed URL for crawling of the area. Area roots (seeds) are used to start area crawling and are included in the area index. The number of roots in an area is limited only by resources.

Example:

root.main = http://www.mycom.com/index.html
root.main = http://www.mycom.com/links.html
      

See also

Security.enabled Directive

Description:Switches On or Off security checks and limitations.
Syntax:security.enabled={on, off}
Default:off
Context:Root configuration file

If set to off, this directive switches off security related checks and limitations.

It is recommended to keep security disabled until you get your search working as expected. Else it may get in the way and it will be harder to tell what is causing problems.

Example:

security.enabled = off
      

See also

Scan.alert Directive

Description:Defines suspicious strings to look for, where to look for them and what level of alert to rise if found.
Syntax:scan.alert=target string | alert level | text
Default:none
Context:Root configuration file, site configuration files

Rise an alert of this level if found this string while scanning output or source of the page, or both.

Each alert entry consists of three fields separated by a pipe character. The first field is the string to look for. The second field is the level of alert to rise: SAFE, UNSURE, UNSAFE or THREAT. The third field is what to scan: OUT - only output of the page, SRC - only source of the page, BOTH - both of them.

It is recommended that scanning is switched off during first crawl, else it will generate too many alerts, as every page and link will be new to it.

Example:

scan.alert = mail( | unsafe | src
scan.alert = $_REQUEST | unsafe | src
scan.alert = http://hostile.com | threat | both
      

See also

Scan.alert.level Directive

Description:Sets the minimal level of alerts to report.
Syntax:scan.alert.level={SAFE, UNSURE, UNSAFE, THREAT}
Default:UNSURE
Context:Root configuration file, site configuration files

The lowest level of alerts is SAFE. If scan.alert.level is set to SAFE, all alerts will be reported.

The highest level of alerts is THREAT. If scan.alert.level is set to THREAT, only alerts of THREAT level will be reported.

Example:

scan.alert.level= UNSURE
      

See also

Scan.content.types Directive

Description:Lists content types of pages to scan.
Syntax:scan.content.types=content type [| content type ...]
Default:none
Context:Root configuration file, site configuration files

Pages with these content types will be scanned.

Site configuration directives, if present, override root configuration directives

There can be several directives in one configuration file. Their contents are merged.

Example:

scan.content.types = text/html | text/javascript
      

See also

Scan.enabled Directive

Description:Switches On or Off document prunning and security scanning.
Syntax:scan.enabled={on, off}
Default:off
Context:Root configuration file

Enable or disable Arch security scanning and document cleaning related features.

Arch can monitor your site for potential threats, new and changed pages, scripts and links. You can define clues to look for and Arch will notify you when it finds something. However, security scanning has a cost because extra processing is involved. For better protection, it is desirable to scan not only output pages (such as those produced by PHP), but the source (PHP) scripts as well. If you do not want to do security scanning or not ready to configure scanning and prunning parameters, just disable scanning.

It is recommended that scanning is switched off during first crawl, else it will generate too many alerts, as every page and link will be new to it.

Example:

scan.enabled = off
      

See also

Scan.file.types Directive

Description:Lists file types of pages to scan.
Syntax:scan.file.types=file type [[|] file type ...]
Default:none
Context:Root configuration file, site configuration files

Pages with these file types will be scanned.

Site configuration directives, if present, override root configuration directives

There can be several directives in one configuration file. Their contents are merged.

Example:

scan.file.types = htm html asp aspx do
scan.file.types = php php3 php4 php5
      

See also

Scan.ignore.bits Directive

Description:Defines strings identifying text fragments to cut out from pages before parsing.
Syntax:scan.ignore.bits=TS fragment start string | fragment end string TE, where TS is { or [, and TE is } or ]
Default:none
Context:Root configuration file, site configuration files

Strings identifying text fragments to ignore. This can be used, for example, to avoid indexing common page fragments, such as headers and footers. Type one pair per line, separate start and end strings with a pipe character.

Each pair must be enclosed in '[' or '{' at the start and ']' or '}' at the end. '[' and ']' mean ignore the fragment, including the boundary string.'{' and '}' mean ignore the fragment, not including the boundary string.

Site configuration directives, if present, override root configuration directives

There can be several directives in one configuration file. Each one defines one fragment to ignore.

Example:

scan.ignore.bits = [ <div class="header"> | <div class="content"> }
scan.ignore.bits = [ <div class="footer"> | </html> }
      

See also

Scan.ignore.bits.after Directive

Description:Defines strings identifying text fragments to cut out from documents after parsing.
Syntax:scan.ignore.bits.after=TS fragment start string | fragment end string TE, where TS is { or [, and TE is } or ]
Default:none
Context:Root configuration file, site configuration files

Strings identifying text fragments to ignore in parsed pages. This can be used, for example, to avoid indexing common fragments of MS Word and PDF files. Type one pair per line, separate start and end strings with a pipe character.

Each pair must be enclosed in '[' or '{' at the start and ']' or '}' at the end. '[' and ']' mean ignore the fragment, including the boundary string.'{' and '}' mean ignore the fragment, not including the boundary string.

Site configuration directives, if present, override root configuration directives

There can be several directives in one configuration file. Each one defines one fragment to ignore.

Example:

scan.ignore.bits.after = [ Contents | page 65 ]
scan.ignore.bits.after = [ Contact | contact@our.com ]
      

See also

Description:Links to ignore for scanning and reporting purposes.
Syntax:scan.ignore.links=URL
Default:none
Context:Root configuration file, site configuration files

The listed links will be ignored. This can be used, for example, to avoid reporting of common links, such as those occurring in headers and footers.

Site configuration directives, if present, override root configuration directives

There can be several directives in one configuration file. Their contents are merged.

Example:

scan.ignore.links = http://www.mysite.com/contact.html
scan.ignore.links = http://www.mysite.com/home.html
      

See also

Scan.ignore.scripts Directive

Description:Script fragments containing these strings will be ignored.
Syntax:scan.ignore.scripts=string1 [| string2...]
Default:none
Context:Root configuration file, site configuration files

Script fragments containing these strings will be ignored. This can be used, for example, to avoid scanning and reporting common scripts, such as those used to generate headers and footers. Use with care as attackers may include one of such fragments in their script to hide it. Separate strings with a pipe character.

Site configuration directives, if present, override root configuration directives

There can be several directives in one configuration file. Their contents are merged.

Example:

scan.ignore.scripts = include("header.inc"); | include("footer.inc");
scan.ignore.scripts = include("sidebar.inc"); 
      

See also

Scan.min.script.size Directive

Description:Script fragments smaller than this size are ignored.
Syntax:scan.min.script.size=size in characters
Default:on
Context:Root configuration file, site configuration files

Script fragments of size smaller than this are ignored. Use with care.

Site configuration directives, if present, override root configuration directive.

Example:

scan.min.script.size = on
      

See also

Scan.report.changed.forms Directive

Description:Switches on and off reporting changed pages with forms in them.
Syntax:scan.report.changed.forms={on, off}
Default:on
Context:Root configuration file, site configuration files

Switches on and off reporting changed pages with forms in them. This notification is of UNSURE level. If your alerts level is set to UNSAFE or THREAT, change reports will not be sent.

Site configuration directives, if present, override root configuration directive.

Example:

scan.report.changed.forms = on
      

See also

Scan.report.changed.pages Directive

Description:Switches on and off reporting changed pages.
Syntax:scan.report.changed.pages={on, off}
Default:on
Context:Root configuration file, site configuration files

Switches on and off reporting changed pages with forms in them. This notification is of UNSURE level. If your alerts level is set to UNSAFE or THREAT, change reports will not be sent.

Site configuration directives, if present, override root configuration directive.

Example:

scan.report.changed.pages = on
      

See also

Scan.report.changed.scripts Directive

Description:Switches on and off reporting changed script files.
Syntax:scan.report.changed.scripts={on, off}
Default:on
Context:Root configuration file, site configuration files

Switches on and off reporting changed script files. This notification is of UNSURE level. If your alerts level is set to UNSAFE or THREAT, change reports will not be sent.

Site configuration directives, if present, override root configuration directive.

Example:

scan.report.changed.scripts = on
      

See also

Scan.report.link.changes Directive

Description:Switches on and off reporting changed links in pages.
Syntax:scan.report.link.changes={on, off}
Default:on
Context:Root configuration file, site configuration files

Switches on and off reporting changed links in pages. This notification is of UNSURE level. If your alerts level is set to UNSAFE or THREAT, change reports will not be sent.

Site configuration directives, if present, override root configuration directive.

Example:

scan.report.link.changes = on
      

See also

Scan.report.new.forms Directive

Description:Switches on and off reporting new pages with forms in them.
Syntax:scan.report.new.forms={on, off}
Default:on
Context:Root configuration file, site configuration files

Switches on and off reporting changed pages with forms in them. This notification is of UNSURE level. If your alerts level is set to UNSAFE or THREAT, change reports will not be sent.

Site configuration directives, if present, override root configuration directive.

Example:

scan.report.new.forms = on
      

See also

Scan.report.new.pages Directive

Description:Switches on and off reporting new pages.
Syntax:scan.report.new.pages={on, off}
Default:on
Context:Root configuration file, site configuration files

Switches on and off reporting new pages. This notification is of UNSURE level. If your alerts level is set to UNSAFE or THREAT, change reports will not be sent.

Site configuration directives, if present, override root configuration directive.

Example:

scan.report.new.pages = on
      

See also

Scan.report.new.scripts Directive

Description:Switches on and off reporting new script files.
Syntax:scan.report.new.scripts={on, off}
Default:on
Context:Root configuration file, site configuration files

Switches on and off reporting changed pages with forms in them. This notification is of UNSURE level. If your alerts level is set to UNSAFE or THREAT, change reports will not be sent.

Site configuration directives, if present, override root configuration directive.

Example:

scan.report.new.scripts = on
      

See also

Scan.script.edges Directive

Description:Defines strings to use to find starts and ends of script fragments in pages.
Syntax:scan.script.edges=script start string | script end string
Default:none
Context:Root configuration file, site configuration files

Defines strings to use to find starts and ends of script fragments in pages. Type in start and end strings separated by pipe character.

Site configuration directives, if present, override root configuration directives

There can be several directives in one configuration file. Their contents are merged.

Example:

scan.script.edges = <? | ?> 
scan.script.edges = <script | script> 
      

See also

Scan.src.content.types Directive

Description:Lists content types of pages to scan source code of.
Syntax:scan.src.content.types=content type [| content type ...]
Default:none
Context:Root configuration file, site configuration files

Source code of pages with these content types will be scanned. Access to source code of these pages must be provided via scan.source.access.url directive.

Site configuration directives, if present, override root configuration directives

There can be several directives in one configuration file. Their contents are merged.

Example:

scan.content.types = text/html | text/javascript
      

See also

Scan.src.file.types Directive

Description:Lists file types of pages to scan source code of.
Syntax:scan.src.file.types=file type [[|] file type ...]
Default:none
Context:Root configuration file, site configuration files

Source code of pages with these file types will be scanned. Access to source code of these pages must be provided via scan.source.access.url directive.

Site configuration directives, if present, override root configuration directives

There can be several directives in one configuration file. Their contents are merged.

Example:

scan.src.file.types = htm html asp aspx do
scan.src.file.types = php php3 php4 php5
      

See also

Sitemap.url Directive

Description:URL of a sitemap file.
Syntax:sitemap.url=URL
Default:none
Context:Site configuration file

A URL of file with pre-processed sitemap data that can be used as a substitute to log processing. See more in Arch deployment manual about generating, encrypting and making available sitemaps of remote sites.

Example:

sitemap.url = http://mySite.base.url/arch/sitemap.dat
      

See also

Solr.url Directive

Description:Address of Solr server that is being used with Arch.
Syntax:solr.url=URL
Default:http://localhost:8993/arch
Context:Root configuration file

Address of Solr server that is being used with Arch. Arch comes with its own copy of Jetty engine and Solr. It is also possible to use Solr installed on a remote computer, especially if several Arch crawlers are being used in parallel.

If the Jetty engine installed with Arch is not being used, disable starting it in the Arch crawling script.

Example:

solr.url = http://localhost:8993/arch
      

Target.db Directive

Description:Defines the database access URL.
Syntax:db.driver=database accesss URL
Default:jdbc:h2:embedded;MODE=MySQL;CACHE_SIZE=524288
Context:Root configuration file, site configuration files

Defines the default database to use. Arch uses embedded H2 by default.

Site configurations may override this directive and use their own databases.

Example:

target.db = jdbc:mysql://localhost/arch?user=myname&password=mypassword
      

See also

Temp.dir Directive

Description:Directory for temporary data.
Syntax:temp.dir=path to directory
Default:$ARCH_HOME/temp
Context:Root configuration file

The directory where crawling data is kept temporarily before sending it to Solr. Contents are deleted after successful crawling. Make sure that this directory has plenty of free space for temporary use.

Example:

temp.dir = /opt/arch/temp
      

Threads Directive

Description:Default number of threads to use for URL fetching.
Syntax:threads=number
Default:10
Context:Root configuration file

This directive is equivalent to Nutch fetcher.threads.fetch configuration property. It is the number of FetcherThreads the fetcher should use. This also determines the maximum number of requests that are made at once (each FetcherThread handles one connection). The total number of threads running in distributed mode will be the number of fetcher threads * number of nodes as fetcher has one map task per node.

If CRAWLING_THREADS directive is present, it overrides this directive.

In sequential indexing mode, crawling depth in area configurations, if present, override this directive.

Example:

threads = 10
      

See also

Threads.area Directive

Description:Number of threads to use for URL fetching when crawling this area.
Syntax:threads.area name=number
Default:10
Context:Site configuration files

This directive is equivalent to Nutch fetcher.threads.fetch configuration property. It is the number of FetcherThreads the fetcher should use. This also determines the maximum number of requests that are made at once (each FetcherThread handles one connection). The total number of threads running in distributed mode will be the number of fetcher threads * number of nodes as fetcher has one map task per node.

If CRAWLING_THREADS directive is present, it overrides this directive.

In sequential indexing mode, this directive, if present, overrides threads directive in the root configuration file. It is ignored in parallel indexing mode.

Example:

threads.documentatio. = 10
      

See also

Usersread.bookmarks Directive

Description:Lists names of users allowed to see indexed bookmarks.
Syntax:usersread.bookmarks=user name[ user name ...]
Default:guest
Context:Site configuration files

Lists names of users allowed to see indexed bookmarks.

Example:

usersread.bookmarks = guest admin
      

See also

Watch.mode Directive

Description:Switches On or Off watch mode.
Syntax:watch.mode={on, off}
Default:off
Context:Root configuration file

Switches On or Off watch mode.

Example:

watch.mode = off
      

See also