This content originally appeared on Level Up Coding - Medium and was authored by Dakota Smith
I walk step-by-step through the process of answering a Yelp SQL interview question labeled “Hard” by StrataScratch
Introduction
StrataScratch is a platform for practicing real interview questions from companies such as Meta, Netflix, Microsoft, and many others. In this article, I’ll be walking step-by-step through the process of answering a SQL interview question from the online directory and review publisher, Yelp.
If you’re a hands-on learner, you can sign up for a free account and follow along with the coding steps here. Please note that while I do like StrataScratch for practicing SQL problems, I have no personal or professional affiliation with the company!
This article assumes a base level of understanding of SQL fundamentals, such as SELECT, FROM, and WHERE. Documentation for the more intermediate-level functions in the final query will be provided as they’re introduced.
The Question
Title: Top 5 States With 5 Star Businesses
Description: Find the top 5 states with the most 5 star businesses. Output the state name along with the number of 5-star businesses and order records by the number of 5-star businesses in descending order. In case there are ties in the number of businesses, return all the unique states. If two states have the same result, sort them in alphabetical order.
Data Schema:
Step 1: Determine the Deliverables
Probably one of the most important SQL skills you can learn is the ability to quickly parse business questions for the pieces that will end up in your final SQL query. In the real world, stakeholders will rarely tell you what to “output” in terms that perfectly match your schema, but the description above still needs to be parsed and translated into SQL.
Before we write any SQL—before even looking at the data schema, perhaps—we have to make sure we understand what’s being asked. So in the simplest terms, what is Yelp asking for?
- The state name
- The number of 5-star businesses in each state
That’s it. Yes, the answer must include only the top 5 states, and the records have to be ordered a certain way, but that’s delivery, not the deliverables. And now that we know our deliverables, we can start writing our query.
Step 2: Imagine a Perfect World
We know our query will have a SELECT and FROM statement. The data has to come from somewhere (a table), and even if we wanted every field from that table, we’d still have to select all of them (using *).
There’s only one table, yelp_business, so we know that’s where the data is coming from. We also know what we’re trying to select, thanks to figuring out the deliverables beforehand.
In this step, we’re going to build the foundation by imagining a perfect world:
SELECT state, num_businesses
FROM yelp_business;
If you’re following along or poring over the data schema, you’ll know that this is a “perfect world” because the query returns an error. That’s because num_businesses isn’t a field already present in the data. It’s what we ultimately want, but it doesn’t exist yet. We have to create it.
Step 3: Break Down the Logic
Steps 1 and 2 are crucial, but Step 3 is where concept meets execution. We’re going to take it slower from here on out.
There are three logical components to num_businesses:
- It’s a count of something
- That something is 5-star businesses
- And we need that count by state (or said more technically, aggregated on the state level)
Each of these components are factored in different places within the query. Let’s go through them one by one:
- A count of something. We can return a count of records by adding COUNT(*) to the SELECT statement
- 5-star businesses. Each business’s star rating is found in the stars field. We can filter only for businesses that have a 5-star rating using the WHERE clause: WHERE stars = 5
- Aggregating the counts by state. To instruct SQL to aggregate (or group) the counts by state, we only need a GROUP BY clause: GROUP BY state
Put together, it looks like this:
SELECT state, COUNT(*) AS num_businesses
FROM yelp_business
WHERE stars = 5
GROUP BY state;
Note that I aliased our count as num_businesses. This isn’t just for readability’s sake, though that would be a good enough reason on its own; the aliasing will be important later.
How to decide what comes next? Well, it comes down to preference. I like to get the simple stuff out of the way. It’s all too easy to forget the small, simple details after overcoming the challenging parts.
The description tells us to order the results of our query by the number of 5-star businesses (num_businesses) in descending order—i.e., most to fewest. It also tells us that in the place of a tie, the results should order alphabetically by state.
We can use an ORDER BY clause at the end of our query to achieve this. And since the alphabetization of state only occurs in the case of a tie, num_businesses will come before state:
SELECT state, COUNT(*) AS num_businesses
FROM yelp_business
WHERE stars = 5
GROUP BY state
ORDER BY num_businesses DESC, state
Next comes the tricky part. The good thing is, we already have a lot of the business logic translated into SQL. We now only need it to do one more thing: Return only the top 5 states with the most 5-star businesses.
There is a tempting pitfall to stumble into here. It’s easy at first to think: Okay, simple. Because the results are already ordered according to the number of 5-star businesses, all I have to do is add LIMIT 5 and the query will only return the top 5 states.
If we were to run this query, the answer would be incorrect. That’s because of this parameter in the description:
“In case there are ties in the number of businesses, return all the unique states.”
Which suggests that it’s possible there’s a tie between states. Upon investigating this, it turns out there’s a 3-way tie for 4th place (to see this, set your LIMIT to 6). When we limit the results to only the top 5 records, the query leaves out that third state tied for 4th.
How do we get that third state? There’s a few different ways. What you decide on will again depend partly on preference and potentially also on which SQL language you’re using. (I use MySQL.)
But let’s think about it logically first, because the bigger challenge here is truly understanding the question. When working with stakeholders in the real world, you’ll find that there’s sometimes a gap between the way they word their request and what they’re really asking for. Even in the best-case scenarios, they still most likely don’t know how to word it for you in SQL terms—if they did, they wouldn’t have to come to you.
The question is asking for the top states according to how they rank in number of 5-star businesses. So what we really need here is the ability to rank the states, and then filter on that ranking.
To do this, I’m going to use a CTE, or Common Table Expression. You could also use a subquery in the FROM statement to the same effect. It’s my opinion that CTEs sound more intimidating to beginners even though they’re essentially just a way of assigning your subquery to a variable (typically making your code easier to read.) Learning resources also tend to teach subqueries before CTEs, suggesting that the latter is more difficult. Again, it comes down to a matter of preference.
But why do we need to use either of these things? Because if you stopped at simply adding the rankings using a RANK() function, you wouldn’t be able to filter on it. Let me show you what I mean.
SELECT state, COUNT(*) AS num_businesses, RANK() OVER(ORDER BY COUNT(*) DESC) AS state_rank
FROM yelp_business
WHERE stars = 5
GROUP BY state
ORDER BY num_businesses DESC, state
This query indeed gets us one step closer, but the problem is I can’t add a AND state_rank <= 5 to the WHERE statement. That’s because the WHERE statement doesn’t recognize our calculated fields—if we tried to filter on num_businesses, it would return the same error.
What we need to do is “flatten” the state_rank field so that the WHERE statement can act on it. We can do this by putting the query above in a CTE, or Common Table Expression.
The inner-workings of RANK() and CTEs each deserve their own article, and are thus a little beyond the scope of this one. To learn more, read here about RANK() and window functions, and here about CTEs.
When we run the following query, SQL will first temporarily store the results of the CTE in memory, allowing the rest of the code to execute on that data as if it were its own table in the database.
Again—the CTE here is ultimately nothing more than the results of the above query. It’s positioned at the top of the code below:
WITH states_with_ranks AS
(SELECT state, COUNT(*) AS num_businesses, RANK() OVER(ORDER BY COUNT(*) DESC) AS state_rank
FROM yelp_business
WHERE stars = 5
GROUP BY state)
SELECT state_rank, state, num_businesses
FROM states_with_ranks
WHERE state_rank <= 5
ORDER BY num_businesses DESC, state
;
Everything after WITH through GROUP BY state) is the CTE. Everything below it is our query. See how the FROM statement in our query references the name of the CTE, states_with_ranks, rather than yelp_business?
The CTE effectively “flattens” the calculated fields by temporarily storing the results of the calculations in memory. We can then select the state_rank and num_businesses fields by name alone. And most importantly, the WHERE statement then recognizes them as fields that can be filtered on.
If you’re having trouble understanding the CTE part but understand how subqueries work, it might be helpful to see the above CTE’s exact code repositioned in a subquery inside of the FROM statement:
SELECT state_rank, state, num_businesses
FROM
(SELECT state, COUNT(*) AS num_businesses, RANK() OVER(ORDER BY COUNT(*) DESC) AS state_rank
FROM yelp_business
WHERE stars = 5
GROUP BY state
ORDER BY state, num_businesses) AS states_with_ranks
WHERE state_rank <= 5
ORDER BY num_businesses DESC, state
;
Note as well that in either case, the ORDER BY clause is moved to the very end. This ensures the results are ordered properly.
In Summary
Like anything else, the more you translate business logic into SQL, the better you get at it. This is useful in an interview of course, but it’s arguably even more important after you’ve landed the job.
There are some who believe the SQL interview should be phased out, and maybe they’re right. But that’s unlikely to happen anytime soon, and learning how to break down a problem into its composite parts is an invaluable skill no matter what language you’re using.
Have a question? Let me know in the comments. Feel free to connect with me on LinkedIn as well.
Oh, and if you have a SQL interview coming up—good luck!
How to Break Down a “Hard” SQL Interview Question was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.
This content originally appeared on Level Up Coding - Medium and was authored by Dakota Smith
Dakota Smith | Sciencx (2022-06-16T12:12:03+00:00) How to Break Down a “Hard” SQL Interview Question. Retrieved from https://www.scien.cx/2022/06/16/how-to-break-down-a-hard-sql-interview-question/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.