Big Data Recruiting: Breaking down Hadoop Ecosystem

Intended Audience:  This post is for recruiters getting into Big Data technical project.


What is Big Data?

Definition of Big Data: extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions.

As far as Big Data goes, for the purposes of sourcing and recruiting "Big Data" talent for your teams, focus on the current technology stack and ecosystem.  If your hiring manager is concerned with making sure potential candidates have worked in environments and similar or larger than your current setup, you ask your manager to give you a few examples that you can ask candidates.

1. Is Your data as big as our "Big Data"?  How to screen candidates and match Big Data Environments.

What are the five V’s of Big Data?

Answer: The five V’s of Big data is as follows:

  • Volume – Volume represents the volume i.e. amount of data that is growing at a high rate i.e. data volume in Petabytes
  • Velocity – Velocity is the rate at which data grows. Social media contributes a major role in the velocity of growing data.
  • Variety – Variety refers to the different data types i.e. various data formats like text, audios, videos, etc.
  • Veracity – Veracity refers to the uncertainty of available data. Veracity arises due to the high volume of data that brings incompleteness and inconsistency.
  • Value –Value refers to turning data into value. By turning accessed big data into values, businesses may generate revenue.

The above "What are the five V's of Big Data" is from

Tip:  Use Five V's of Big Data information gathered from your hiring teams into pre-screen questions and use the info to build you "Target List" of potential companies to source from rather than jumping right in and looking at candidates. 


2. Breaking Down the EcoSystem is Key to finding great candidates.

It can be overwhelming for sourcers and recruiters who have never been exposed to HDFS and don't make the mistake that I commonly do, and get too deep in the weeds.  With every new tech stack, you are working with, it helps to zoom out a little and see how all the new buzzwords and acronyms fit together "BEFORE" you start searching for new candidates and jumping on the phone with potential candidates.

Let's take a look at what I found when googling "Explain the Hadoop Ecosystem"..

hadoop-ecosystem for recruiters


Now I'm a visual learner, so the fact that the above chart uses colors and groups technology is amazing. (click on the image to make it larger)

Here is how I like to break complicated job requirements into a robust boolean search string that gives you the best chance and find all the qualified candidates.

The sourcer in you might look at the above chart and see all the valuable new keywords that can be used to search, and I'm right there with you. But hold on one memento.  Try organizing your keywords in a similar way as the above diagram.

Example of an organized Boolean Search String Stack:

Data Visualization: (''SAS Visual Analytics'' OR Tableau OR ''SAP Lumira'' OR R OR D3.JS OR iCharts OR ''Timeline JS'' OR ''Apache Zeppelin'')

System Deployment: (Ambari OR Mesos OR Marathon OR HOYA OR BigTop OR Deploop OR ''Apache Eagle'' OR ''Cloudera HUE'' OR Myriad OR Brooklyn OR ''Apache Helix'' OR BuildLoop OR ''SequenceIQ Cloudbreak'')

Data Integration:

Service Programming:




Distributed Programming:

SQL on Hadoop:

NoSql Databases:

Data Integration:

Machine Learning:

Distributed File Systems:

Now keep in mind that the above sourcing search strings are not complete yet.  For each of the keywords used in the strings, we need to research and play around with all the synonyms and variations that these words will look like on profiles and resumes.  So I would call these my Version 1.0 strings.

What is the Alternative?

As new sourcing tools popup and the promises of AI search start in make our searches for us, it is very difficult to make adjustments to our searches if don't understand how they are made.

Here is an example of a boolean search string I got from one of the leading sourcing platforms when I entered "Hadoop Engineer".

Screen Shot New AI Sourcing Tool


Now I love new sourcing tools just as much as the next sourcer, but take a look at the automated search string this thing spit out for a minute.


(("hive" OR "hbase" OR "apache pig" OR "apache spark" ) AND ("MapReduce" OR "Hadoop" ) AND ("Cassandra" OR "big data" OR "spark" OR "pig" OR "apache storm" OR "kafka" OR "hdfs" OR "mahout" OR "vertica" ) )

Let's break this down by stacks:

(("hive" OR "hbase" OR "apache pig" OR "apache spark" )

("MapReduce" OR "Hadoop" )

("Cassandra" OR "big data" OR "spark" OR "pig" OR "apache storm" OR "kafka" OR "hdfs" OR "mahout" OR "vertica" )

So this search string is telling whatever database or search engine that Hive and Hbase and Apache pig and Apache spark are all equal and that is just not the case.  Take a look at the Hadoop ecosystem diagram again.

Sure Hive and HBase are both "Big Data" technologies but they serve different purposes. Hive is a query engine that whereas HBase is a data storage particularly for unstructured data.

So, my point is that if you don't make your own structured search strings in an organized fashion, you will be missing out.

Let me put another way because I don't want the technology to cause confusion.

If you were building a search string to find possible lunch options, would this search string make sense?

(hamburger OR burger OR pizza OR "french fries" OR "soup" OR salad)

Nope, "french fries", salad and soup area side dishes and should be put on a separate search string and then stacked.

(hamburger OR burger OR Pizza) ("french fries" OR fries OR "salad" or "soup")

Hadoop Ecosystem (HDFS) Overview


Hadoop Ecosystem (HDFS).