Three Literacy Tests for Hiring Data Engineers
Updated: Apr 2, 2022
Photo by Jon Tyson on Unsplash
Data is the foundation of the IT profession and testing potential hires for their data literacy is crucial. Surprisingly, or perhaps not surprisingly, data literacy follows under numerical literacy and literacy in general as not as common and one would hope.
Literacy - The Cover Letter
We can test job applicants for literacy by requiring them to write a cover letter. "Tell you about yourself and why you want to work here." It's a simple assignment and it will separate those who can write with those who can't. Now that remoteness has become the new normal (to my chagrin), the ability write out one's ideas in clear sentences is more important than it was before the pandemic.
Don't ever hire anyone that can't or won't write a cover letter, even if it's the lowest entry level position.
Numerical Literacy - Understanding 1000 days
How about testing for numerical literacy? Does the applicant understand numbers? Data is all about numbers, so we should see if they understand. Here are some questions:
What is 1000*1000?
Why ask this question? 1000 thousands is a million. A numerically literate person will be able to answer this question off the tip of their tongue, without doing calculations. Numerical literacy is about knowing what numbers mean, so starting with a thousand gives you a good start into see if the applicant understands what 1000 actually means in the real world. Some follow ons:
How many years is 1000 days?
This is a great one for data engineers. They should now 365 days in a year and so divide by 1000 and they get around 2.7 years. Date math is the foundation of data analysis.
Now let's talk about a million. Do they understand a million? Building on the 1000x1000 question, ask them how many years in a million days? They should be able to build on previous concepts, so they will say around 2700 years. That's good enough. It will let you know that they know date math - also, that they know how numbers related in the real world. 2700 years a long time and 1 million is a lot bigger than 1000. I'm amazed at how few "data engineers" actually understand this.
Now for something that tests for complexity knowledge without needing to regurgitate algorithms.
What is the log( one million )
This is a great question because they might say 6 (log of 10) correct or they might say around 10 (13 and some change) for the log_2. Both answers will tell you something about the candidate and both are good answers. Round this questioning out with a philosophical question.
What is the significance of n*log(n) algorithms?
A really good data engineer will have a fascinating answer.
Testing for Data Literacy
SQL is the language of data. To be data literate, one must know the language of SQL and how it is used. To test for data literacy, give the applicant a set of questions to answer and have them write the SQL. Review the SQL with them after they have answered every question.
Warm them up with some theory. A little will let you gauge their data sophistication.
Ask them what the difference is between a star schema and a snowflake schema. They should have some idea that a star schema involves a set of dimensions revolving around a fact and a snowflake schema is a normalized version of that with multiple facts. The answer can be fairly loosy-goosy, but if they say, "I don't know," then they don't know data. If they say something about pivot tables, then that's good enough - they know the concept if not the words.
Give them a question to see if they know how to find holes in data.
How to write a query to output the missing records for either of the two lists?
select c1.email , c2.email from customer_1 c1 full outer join customer_2 c2 on c1.email = c2.email where c1.email is null or c2.email is null
Double kudos if they use the "comma-first" notation. That will indicate that they are used to working with large numbers of columns and commenting or uncommenting as they do analysis.
How to count the distinct number of missing records for the customer_2 table?
select count(distinct c1.email) from customer_1 c1 full outer join customer_2 c2 on c1.email = c2.email where c2.email is null;
This one is a bit tricky and will require them to understand that they have the records for customer_1 and we will count the values where the customer_1 email does not join.
Suppose a mailer was only sent out to all the customers in the customer_1 list, but not the customer 2 list. I record every email sent in an email_sent table by email. How would I use a CTE to query the email_sent table for the customers in the customer_2 list that were not sent the email?
with unsent_customer_2 as ( select distinct c2.email from customer_1 c1 full outer join customer_2 c2 on c1.email = c2.email where c1.email is null ) select us2.email from unsent_customer_2 us2 left outer join unsent_customer_2 es on us2.email = es.email
Suppose I want to send coupons the top purchaser of the year and anyone who has not purchased in the year. How could I use the list of all emails to get the top purchaser and any purchasers who did not buy anything?
with all_emails as ( select distinct coalesce(c1.email, c2.email) as email from customer_1 c1 full outer join customer_2 c2 on c1.email = c2.email ) select rank() over (order by sum(coalesce( case when PURCHASE_DATE > dateadd(year, -1, current_timestamp) then amount else null end, 0)) desc) top_purchaser , sum(case when PURCHASE_DATE > dateadd(year, -1, current_timestamp) then amount else 0 end) yearly_amount , ae.email from all_emails ae left outer join purchases p on p.email = ae.email group by ae.email qualify top_purchaser = 1 or yearly_amount = 0;
A well-disciplined hiring process helps bring consistent results. It’s the most important thing a team does. A standardized hiring checklist creates a fair process that starts the integration of a new hire into the team even as they are in the interview. When the team members know well what the newcomer knows and does not know, it builds the expectation and the bonds that make for great teams.
Solutions to the SQL questions are here: https://github.com/timowlmtn/bigdataplatforms/tree/master/data/sql_test