The objective of the project, nicknamed "HT Jobs", was to "scrape" job openings from various Indian Government websites and other websites dedicated to job openings in the government sector in India. The goal was to create a database as a web interface.
HT Jobs has been done in Drupal 7
Here is the complete workflow in a nutshell depicting how the whole process worked.
The application provides list of search engines from where the URLs will be listed. Application specific domains and search keywords can be provided in the form of following sample queries.
Three search engines like Google, Yahoo and Bing have been provided within the application as a default. Other importers can be added if needed.
For all approved URL’s after the Crawling has been done, content is scraped by specifying the parsing logic (mapping) by URL tech’s. Scraper then extracts title and body from the pages and maps it to relevant fields in Drupal. It is possible to edit the URL before scraping. Mapping is made available for 15 URL’s in the system.
A workbench-based workflow is provided to act as a dashboard for various roles on the site.
"Workbench" effectively provides overall improvements for managing content that Drupal does not provide out of the box. Workbench gives us three important solutions :
Workbench Moderation is being used to add moderation states apart from the Drupal default of “published” and “unpublished”.
Some roles used on the website are as follows [These roles correspond to any media house publishing organization’s roles]:
The main aim of the MVC architecture is to separate the business logic and application data from the presentation data to the user.
In general, Drupal does fulfill most of the criteria including backend support, flexible modular architecture, organization boundary support, less time and effort in implementation and customization, easy upgrade, highly scalability, and strong community support. The flexible modular architecture of Drupal allows removing/installing modules at will. This save resources and improve performance.
The organization boundary support allows multiple organizations using the same code base which is also save system resources. The result of this evaluation helps us decide to choose Drupal for the migration.
There are about 10 important contributed modules that were being used to do assist us in crawling and scraping. A custom module was created for mostly templating and other general purpose uses.
Aggregator - Aggregates syndicated content (RSS, RDF, and Atom feeds).
Feeds - Aggregates RSS/Atom/RDF feeds, imports CSV files and more.
Feeds Admin UI - Administrative UI for Feeds module.
Feeds Import - An example of a node importer and a user importer.
Feeds News - A news aggregator built with feeds, creates nodes from imported feed items. With OPML import.
Feeds XPath Parser - Parse an XML or HTML document using XPath.
Job Scheduler - Scheduler API
Rules Scheduler - Add-on to Rules for scheduling the execution of rule components using actions.
Workbench - Workbench Editorial Suite
Workbench Moderation - Provides content moderation services
Custom module
Feeds is a pluggable system for importing or aggregating content into Drupal. Out of the box, it supports
Feeds module made the crawling and scraping pieces pretty simple. The module, out of the box, allowed for admin to enter URLs from Google/Bing/Yahoo and set mappings for various fields to pull jobs
Website being in Drupal 7 meant that some modules perfectly fine in Drupal 6 just didn’t work in D7 or had some bugs due to the architectural changes in D7 core.
We had to write patches for HTML Tidy’s invalid markup and Menu_per_role modules. Getting the patches updated on Drupal.org was a challenge in itself.
Duplicate entries issue [due to same URLs picked from different crawlers] was resolved by writing custom code. The code essentially checked for the same title and content to check whether the same job was being pulled again.
Rest of the content checking is done manually by the "Technician" user.
The project resulted in saving a lot of manual time and effort for HT. Initially all this data was maintained in spreadsheets and hence problems such as data redundancy, manual check of the government portals constantly, updation of spreadsheets existed.
Apart from that, importantly, governance problems were resolved as users were able to access data only that was important/needed by them and could not modify/delete any critical data, even by mistake.
Drop us a line below to get in touch with us.