SQL Not Exists: Filter Out Records that Exist in a Subquery

mysql vs mysqliThe SQL language has a number of ways to filter record sets. The “where” clause in your select statements is where most people list the business rules that filter out records. You can use “JOIN” statements with SQL in them, but these are usually more difficult to read. The “NOT EXISTS” statement uses a subquery to filter out records that do not exist in the underlying subquery. This logic is sometimes hard to grasp for new SQL coders, but this article explains the logic and alternatives to the NOT EXISTS statement.

Get an introduction to SQL with a course at Udemy.com.

Subqueries and SQL Statements

A subquery is basically a query within a query. The following query is a standard, simple SQL select statement:

select * from customers where customerId = 5

The “select *” part tells the SQL database to return all columns. The asterisk in SQL indicates that you want to return all columns, but you can also specify columns individually. As a matter of fact, most database designers prefer that you specify columns, because the asterisk returns audit information that you don’t need such as creation dates, update dates and foreign key information.

The “where” clause indicates that you want to return just customers with an id of 5. You can, of course, return more than one record. You can assume from the above statement that “customerId” is the unique column. A customerId should only be one value per customer, so you can assume the above query only returns one record. However, it also depends on the database design. A good database design should have column names that identify the type of information contained in a specific column.

Instead of using the “where customerId = 5” clause, you can add a subquery. That’s where NOT EXISTS works. For instance, take the following query:

select * from customers where NOT EXISTS (select customerId from orders)

The above statement basically says “give me all customer records where the customerId does not exist in the orders table.” Therefore, the business logic is “I want to see all customers who do not have orders.” The NOT EXISTS is the opposite of the EXISTS statement, so you just reverse the business logic if you want to see the opposite results. For instance, the following statement gives you the opposite results:

select * from customers where EXISTS (select customerId from orders)

The above statement says “give me all customer records where the customer has placed an order.” You’ll probably need to use EXISTS and NOT EXISTS in different parts of your application depending on your business logic.

When you run a query such as the ones above, it should be noted that the subquery runs first. The subquery in the EXISTS and NOT EXISTS statements is the query that returns order records. This query runs first. Then, the main or “outer” query runs. In the above examples, the outer query is the select statement based on the customers table. Consider the following query:

select * from customers where NOT EXISTS (select customerId from orders where createdate >= getdate() - 30)

In this statement, a list of orders is returned based on the order date. It’s assumed that “createdate” is the column that contains the order’s creation date. Therefore, any order placed within the last 30 days is returned from the subquery. These records are then used for the outer query, which is again the select statement run against the customers table.

New to the SQL language? Take a course at Udemy.com to learn how to program for your database.

Combining NOT EXISTS with Other Where Clause Filters

The where clause can take multiple logic filters. Suppose you want to get a list of customers who placed an order, but you only want customers who have signed up within the last 60 days. For instance, you might want to know how many of your customers placed orders, but you want “new” customers. “New” is an arbitrary value, so you set that value at 60 days. You can write SQL statements that match this business logic. You can usually write SQL statements in several ways to match the business logic, but in this example you want to use the NOT EXISTS statement. The following code is a SQL statement that codes for the specified business logic:

select * from customers where EXISTS (select customerId from orders where createdate >= getdate() - 30) and NOT EXISTS (select customerid from customers where createdate <= getdate() – 60)

The above statement combines the EXISTS and NOT EXISTS statements. First, the EXISTS subquery runs. This is a subquery that gets a list of customers where orders were placed within the last 30 days. Next, the NOT EXISTS subquery runs. This subquery gets a list of customers that were created prior to 60 days ago. Since the second subquery uses the NOT EXISTS statement, the main query does a match with the NOT EXISTS subquery against the customer database, and filters out records where they exist in the subquery. Therefore, it only gets records where the customerId is not located in the NOT EXISTS subquery.

The opposite is true for the EXISTS statement. This part of the where clause keeps records that exist in the corresponding EXISTS subquery.

You can make your SQL statements much less complicated. The above statement can be written in different ways, and these ways are probably a lot less complicated.  The typical way to rewrite a SQL statement with an EXISTS or NOT EXISTS statement is with join statements. The following code is an example of rewriting the above statement:

select * from customers c join orders o on c.customerId = o.customerid where o.createdate >= getdate() – 30 and c.createdate <= getdate() – 60

This statement contains the same business logic as the previous ones. The difference is the join statement. For most database designs, your database administrator or DBA will require you to use joins instead of subqueries where appropriate. With the join, the two tables customers and orders are joined together on a primary key and foreign key. It’s assumed that the customers table “customerId” column is the primary key for the customers table, and the customerId in the orders table is the foreign key. You can assume by the query that the customerId is unique, and primary indexes are clustered indexes. You would then assume that the foreign key in the orders table is set up as an index. This design will keep your table’s performance efficient.

You might want to know the difference between the subqueries and the joins. The one major difference you have to account for is nulls. Some database designers allow for nulls, but even if they don’t, joins will be affected by the linked tables. In the join statement, if a corresponding order does not exist, the record is complete filtered out.

For instance, suppose you have a customer with an order and a customer without an order. With the join statement, you will not get the customer without an order in the record set. The reason is because the join statement does not have a corresponding order to link to, so the customer is filtered out. With the exist statement, nulls still count. For instance, if your database isn’t properly set up, you might have some null values in the orders customerId column. These records would still return when the subquery’s where clause is based on the orders creation date. Nulls would match up if you have one in the customers table and one in the orders table.

The optional way to get these null values is with a left join. A left join returns all customer records and then any linked orders records. If there is no record in the corresponding orders record, the columns for this table will return null.

The join and subquery concepts are hard to grasp, so the best way to test them is to write them in your SQL tool and run them against your database. The following code uses the left join instead of a join for your table:

select * from customers c left join orders o on c.customerId = o.customerid where o.createdate >= getdate() – 30 and c.createdate <= getdate() – 60

Technically, you should get the same data set, but the left join can help you find orphaned records, which are records that exist in the orders table with no joining customer Id. This is bad table design, and your SQL database engine should block deletion of a customer without deleting customer orders. For this reason, most database administrators avoid allowing delete queries run against any table. Instead, you have an “active” and “inactive” column. You set a column to inactive, set its audit date and then create a new record. This record becomes your active record and your stored procedures should account for these active and inactive records.

The “IN” and “NOT IN” Statements

EXISTS and NOT EXISTS are the two preferable statements used in SQL procedures. However, you can also use IN and NOT IN. These two statements also use subqueries to filter out records. Using the above EXISTS and NOT EXISTS statements, the following code switches them out for the IN and NOT IN statements:

select * from customers where IN (select customerId from orders where createdate >= getdate() - 30) and NOT IN (select customerid from customers where createdate <= getdate() – 60)

The above statement has the same business logic as the previous statements and returns the same record set for your application.

Execution plans for NOT IN and NOT EXISTS are different. However, you want to choose the plan that is the fastest but still matches your business logic. SQL Server Management Studio has an execution plan that tells you which one is most efficient and follows business logic.

Performance and speed are important factors when determining your execution plans and how you design your queries. You want queries to obtain and edit your data exactly how business logic specifies, but you don’t want to use functionality that takes too long. For instance, you want to focus your where clause on columns that contain indexes.

Indexes help order your database tables. These database objects greatly increase performance. For instance, suppose you do several queries on a customers column that contains the customerId. You’d want to create queries that run against this column, because not only is it the primary key and unique, but it also uses an index. Primary keys are automatically set as indexes, so you don’t need to manually create an index when you set a primary key. However, you need to manually set other indexes on your tables.

For instance, suppose your customer table contains a signup date column and you use this column in many of your queries. You can set an index on this column to speed up queries. You also want to join on these columns. Joins should use indexed columns, because these speed up queries as well.

Using indexes and execution plans, you can keep your queries streamlined and avoid slowing down your database. If you have several queries that run against your database tables, it can take several seconds to return a record set. This might not seem like a lot of time, but seconds matter when you create websites such as ecommerce stores. Users are more likely to bounce when it takes too long to load a web page. For this reason, search engines such as Google also use speed as a metric to rank your pages.

The easiest way to determine the best statement is to write down your business logic. Does EXIST make sense? Does NOT EXIST make sense? Make sure you thoroughly test your queries before you upload them to your production database. Check your tables for indexes and any data integrity issues. A database administrator will help you, but unforeseen issues happen when you don’t design your tables properly.

Udemy.com can help you with your table design and understanding the SQL language. The SQL language is similar across platforms, but there are slight differences between SQL Server, MySQL and Oracle. Take a beginner SQL course at Udemy.com to familiarize yourself with the language.