Configuration file

Configuration
- Section “options”
- Section “profiles”

Configuration

You find the config in the ./config/ directory. There is a json file crawler.config.json with several sections. By default the project is shipped with an example file you can use for adaptions.

Section “options”

setup of default options … not specific to a website profile

Variable	type	Comments
analysis	key	Values for analysis
… -> MinDescriptionLength	integer	html check - Minimum count of chars in the description
… -> MinKeywordsLength	integer	html check - Minimum count of chars in the keywords Remark: the importance for SEO optimization is very low. Google does not process any keyword for their search index. If you use the ahCrawler search hits in keywords get a higher ranking than those in the content. If you do not use the search engine then you can set the value to 0 (zero) to disable the check for keywords.
… -> MinTitleLength	integer	html check - Minimum count of chars in the document title
… -> MaxPagesize	integer	html check - Limit to show as large page (byte)
… -> MaxLoadtime	integer	html check - Limit to show as long loading page (ms)
auth	key	authentication for the backend; you can setup a single user only. You can disable this (quite simple) internal authentication with removing (or renaming) this section. If you need several users: disable this section and setup an authentication with apache users based on directory or location.
… -> user	string	username
… -> password	string	hash of wanted password using password_hash()
crawler	key	defaults for crawling
… -> memoryLimit	string	Memory size for CLI only; the value is a valid memory size for ini_set(‘memory_limit’, [value]); the default is 512M
… -> searchindex	key	defaults for search index crawler
…… -> simultanousRequests	integer	default count of simultanous requests for crawling pages for all projects you setup. The crawling for the search index makes http GET requests (it loads the content). The minimum number is 2. Hint: Do not overload your own servers. Speed is really not important for a cronjob.
… -> ressources	key	defaults for resources crawler
…… -> simultanousRequests	integer	default count of simultanous requests for resources scan for all projects you setup. The crawling for the search index makes http HEAD requests (it does NOT load the content). The minimum number is 2. Hint: Do not overload your own servers. Speed is really not important for a cronjob. And: you can setup HEAD reqests with a much higher value than GET requests. Try 5 … and then higher values.
… -> timeout	integer	timout value for all crawling requests (GET and HEAD) in sec.
… -> userAgent	string	User agent to use for crawling. Background: a few webservers send an http error code by detecting a crawler. If you set a user agent of a “normal” webbrowser then the chance is higher to get a valid response. Hint: In the web gui go to the setup to use the user agent of your currently used browser.
database	key	define database connection
… -> database_type	string	type of the PDO database connection; so far only sqlite and mysql are supported. For first tests use “sqlite” - it OK for websites with a few hundred pages and is more simple to setup. one of “sqlite”
… -> database_file	string	sqlite only: name of the database file. default is “__DIR__/data/ahcrawl.db” (where __DIR__ will be replaced with application root directory)
… -> database_name	string	non-sqlite only: name of the database scheme
… -> server	string	non-sqlite only: name of database host
… -> username	string	non-sqlite only: name of database user
… -> password	string	non-sqlite only: password of database user
… -> charset	string	non-sqlite only: cahrset; example: “utf-8”
cache	boolean	Use cache in backend pages to speed up pages. The caches expires with a new craling process for the viewed website profile.
This feature is work in progress and is disabled by default (false).
debug	boolean	show debug infos in the backend pages.
lang	string	language of the backend interface; one of “de”
skin	string	Name of the current skin. It is a directory name in ./backend/skins/.
menu	array	hide menu items in the backend - key is the name of the page (have look to the url in the address bar ?page=[name]) - value is one of true\|false
menu-public	array	hide menu items in the public frontent - key is the name of the page (have look to the url in the address bar ?page=[name]) - value is one of true\|false
searchindex	array	key
… -> regexToRemove	array	list of regex to remove from html body for the search index; by default it contains - html comments - script and style sections - link rel - nav tags - footer tags
… -> rankingWeights	array	Define factors to weight the search results with the searchform in your website. A direct match of the searchterm with a found word should be higher than a match in the middle of a longer word. The sections are: - matchWord - Exact hit of a whole word - WordStart - Hit at the beginning of a word - any - Hit anywhere in the text In each of these section are places where the search term is scanned. A hit in the url i.e. should have a higher weight than in the content. - content … in the content - description… in the meta description - keywords … in the keywords - title … in the title tag - url … in the url

{
    "options":{
        "database":{
            "database_type": "sqlite",
            "database_file": "__DIR__/data/ahcrawl.db"
        },
        "auth": {
            "user": "admin",
            "password": "put-md5-hash-here",
        },
        "lang": "en",
        "crawler": {
            "memoryLimit": false,
            "userAgent": false,
            "searchindex":{
                "simultanousRequests": 2
            },
            "ressources":{
                "simultanousRequests": 2
            }	
        },
        "searchindex": {
            "regexToRemove": [
                "<footer[^>]>.*?<\/footer>",
                "<nav[^>]>.*?<\/nav>",
                "<script[^>]*>.*?<\/script>",
                "<style[^>]*>.*?<\/style>"
            ]
        },
        "analysis": {
            "MinTitleLength": 20,
            "MinDescriptionLength": 40,
            "MinKeywordsLength": 10,
            "MaxPagesize": 150000,
            "MaxLoadtime": 500
        }
    },
    (...)
}

Section “profiles”

Setup of websites to crawl. The first index below is an integer value that is called profile id. The table below describes all values of a profile id.

{
    "options":{
	(... see above ...)
	},
    "profiles":{
		"[id]":{
			(... profile settings ...)
		}
	}
}

Variable	type	Comments
label	string	A short name for the website. It is shown inside the admin as tab label on the top.
description	string	description text for this website
userpwd	string	optional setting for password protected websites with basic authentication The syntax is [username]:[password]
searchindex	key	definitions for the crawler of the search index
… -> urls2crawl	array	Start urls for scan
… -> iDepth	integer	Maximum path level to scan
… -> iMaxUrls	integer	For initial tests: set max. count of urls to scan (0 = no limit)
… -> include	array	Array with regex that will be applied on any detected full url in a link. The crawler adds an url if it matches one of the regex. Default: none; any url (matching the sticky url) will be followed.
… -> includepath	array	Array with regex that will be applied on any url path of a detected link. The crawler adds an url if it matches one of the regex. Default: none; any url (matching the sticky url) will be followed.
… -> exclude	array	Array with regex that will be applied on any url path of a detected link. The crawler skips an url if it matches one of the regex. Default: none; any url (matching the sticky url) will be followed. Remark: Even if it is empty … the crawler follows several disallow options by default: - disallow for agent “*” in robots.txt - disallow for agent “ahcrawler” in robots.txt - meta robots no index and nofollow in html head - attribute nofollow in a link
… -> regexToRemove	array	list of regex to remove from html body for the search index; it overrides the default in the options section: options -> searchindex -> regexToRemove.
… -> simultanousRequests	integer	count of simultanous requests; it overrides the default in the options section: options -> crawler -> searchindex -> simultanousRequests.
frontend	key	definitions for the frontend (search form for your website)
… -> searchcategories	array	items for search categories based on url path Syntax: [key] - label of the filter [value] - WHERE value for sql statement i.e. “… my blog”: “/blog/%”
… -> searchlang	array	items for language select box in the search form i.e. [“de”, “en”]
resources	key	definitions for the crawler of the search index
… -> simultanousRequests	integer	count of simultanous requests; it overrides the default in the options section: options -> crawler -> ressources -> simultanousRequests

Customization Configuration file

Table of Contents

Configuration

Section “options”

Section “profiles”

Customization Configuration file

Table of Contents

#Configuration

#Section “options”

#Section “profiles”

Configuration

Section “options”

Section “profiles”