Crawling FDS

Case Study:

To monitor, control & enhance documents, iSN has developed an application that will provide downloaded documents from URLs, reviews, sort & upload these documents onto the client FTP. iSN will help client to define its document strategy, and control & enhance the entire documents database accordingly.

Knowledge:

"iSN Web Crawler" is the unique software to download different types of documents from specific URLs as per the defined keywords & schedule. It has also provision to review, sort downloaded documents & upload them onto the defined ftp with an auto email notification to the client. User can also monitor the updates on web URLs, which helps to track webpage amendments. The tool has a provision to monitor, measure & control the administration of document sourcing.

Process Flow:

none

Import URL List:

List of URLs are imported into the crawler. There would be two different lists of URLs:
• List of website (home page) URLs – These URLs could work on entire website to track the updates (addition/deletion) of the website.
• List of IR page URLs - These URLs could work only to source documents.

Setting up Keywords:

Different keywords were set to hit the relevant links/files. The keywords could be added/removed/changed according to the requirements.
• For e.g. [*.pdf] to download all pdf files, [annual*.*] to download all documents those having “annual” word in the document name.
• Initially, all the documents were targeted & captured. However, the filters were set eventually based on further results, and specific document(s) were targeted / downloaded from the web pages.

Scheduling and configurations:

• Website URLs were scheduled according to the list of different countries and companies. The Crawler would start searching web pages according to frequency & time defined.
• Schedule for each country was set according to the regional working hours.
• Frequency was set to twice a day for each company website. That means crawler would search web pages every 12 hours.
• For e.g. frequency for Indian companies will be every 12 hours i.e. 12pm & 12am. This would make sure that document will be sourced within 12 hours.

Detect Changes on webpage’s:

Crawler detects changes on web pages according to the configuration set in the tool. Detected documents will be downloaded automatically in specified folder.

Reset dead / changed URLs:

The Crawler could also display URLs, those were connected successfully. User then could view those URLs again and could make necessary changes in URLs in the interface:

URL movement detection:

The websites may add / remove URLs frequently, hence the crawler is set to detect number of URLs added / removed under the respective website. Thus, the user could access those URLs to review URL activities.

Sorting of downloaded documents:

• The Crawler would download documents into a specified folder
• The user needed to run the option to sort & verify downloaded documents.
• The user would click on each document link & would select relevant document type i.e.Annual report, interim, other document etc. to classify documents.
• The sorted documents would be uploaded automatically on ftp & email will send.

Summary Report:

The user could monitor the status of the crawler. The user could click “View list button” to see the list of URLs or documents under each status type. The user could also send the summary status via email.
UA-65831690-1