An example application of algorithms and data structures in the software industry

A very common question among people new or about to enter the software industry is where and how are algorithms and data structures used. Software development is mostly about applying known algorithms to a particular problem domain and tweaking it with some data structure either to fit the input space or to make some particular performance improvement due to some specificity of the problem domain.

Below is one such example from the domain of workplace analytics.

Let’s take the example of workplace analytics, a new domain where organizations are using data to find out what perks work, what attributes contribute more to employee satisfaction and performance. One example is Google finding out, based on data, that new mothers had a very high attrition rate, and Google increased the maternity leave by two months and it reduced the attrition in new mothers by 50%. Similarly, some sales focused organizations have found using workplace analytics tools, that employees not the ones working extra hours, but ones with more collaboration hours have better sales performance.

One common input to such workplace analytics tools is the organizational data, which has the hierarchy information as well, generally represented in tabular format where each row represents an employee and columns are the attributes of that employee with one column for the manager. Naturally, this data can be represented as a graph, and this has to be validated, to find if it is a single tree, or multiple trees, or if there are cycles. Why is this necessary? Generally, an organization will have one root, and a single tree. If the input is disjoint, or if there are cycles then it means that very likely the input is wrong, and the user of the tool will have to be notified of the same so that the user can make the necessary corrections and provide the input file again.

All the above validation must be done in real time, a few seconds, as in the user uploads the file and within a few seconds gets a report with all validation failures. Thankfully, as this is organizational data of people who need to be analyzed, the number of rows in the input are generally less than 100,000 and rarely in the range of a million. This a not a big data problem and can be solved in memory provided the correct algorithms and data structures are used.

The subtlety to note here is that one needs to think of this graph as a directed graph, and hence a simple two color DFS will not work, and one will have to use the three color DFS. The time complexity of the algorithm is O(V+E) where V is the number of vertices and E is the number of edges. As all uploads will have a approximately single row per employee, V=E, and time complexity becomes O(V), which is linear in terms of number of employees. For a general graph E = V^2, but due to the characteristics of the domain of this problem, the solution is linear instead of quadratic and can be done in memory in real time.

Comments

Popular posts from this blog

Performance improvement of MySQL inserts in Python using batching

Connect to MySQL 5.7 from Python using SSL

Connect to MySQL Server 5.7 from PHP 7.0 using SSL