List of URLs are imported into the crawler. There would be two different lists of URLs:
• List of website (home page) URLs – These URLs could work on entire website to track the updates (addition/deletion) of the website.
• List of IR page URLs - These URLs could work only to source documents.
Different keywords were set to hit the relevant links/files. The keywords could be added/removed/changed according to the requirements.
• For e.g. [*.pdf] to download all pdf files, [annual*.*] to download all documents those having “annual” word in the document name.
• Initially, all the documents were targeted & captured. However, the filters were set eventually based on further results, and specific document(s) were targeted / downloaded from the web pages.
• Website URLs were scheduled according to the list of different countries and companies. The Crawler would start searching web pages according to frequency & time defined.
• Schedule for each country was set according to the regional working hours.
• Frequency was set to twice a day for each company website. That means crawler would search web pages every 12 hours.
• For e.g. frequency for Indian companies will be every 12 hours i.e. 12pm & 12am. This would make sure that document will be sourced within 12 hours.
• The Crawler would download documents into a specified folder
• The user needed to run the option to sort & verify downloaded documents.
• The user would click on each document link & would select relevant document type i.e.Annual report, interim, other document etc. to classify documents.
• The sorted documents would be uploaded automatically on ftp & email will send.