sql delete duplicate rows but keep one with code examples

When working with large datasets in SQL, it's common to find duplicate rows in your tables. Duplicate rows can cause various issues that can affect the performance of your database and generate unexpected results. That's why it's important to know how to delete duplicate rows but keep one.

In this article, we'll review different methods to remove duplicate rows in SQL using practical code examples. We'll explore approaches for deleting duplicates in PostgreSQL and MySQL databases, which are two of the most popular relational database management systems used today.

Before diving into the examples, let's define what we mean by "duplicate rows." Simply put, duplicate rows are those that contain identical values in all columns. For instance, consider the following table.

CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  name TEXT NOT NULL,
  email TEXT NOT NULL,
  age INTEGER NOT NULL
);

If we insert the following rows into this table:

INSERT INTO users (name, email, age) VALUES
('John', 'john@example.com', 35),
('Mary', 'mary@example.com', 28),
('John', 'john@example.com', 35),
('Alex', 'alex@example.com', 42),
('Mary', 'mary@example.com', 30),
('Alex', 'alex@example.com', 42);

We'll obtain the following table, which includes some duplicate rows.

id | name |     email      | age 
----+------+----------------+-----
  1 | John | john@example.com |  35
  2 | Mary | mary@example.com |  28
  3 | John | john@example.com |  35
  4 | Alex | alex@example.com |  42
  5 | Mary | mary@example.com |  30
  6 | Alex | alex@example.com |  42

To remove the duplicates, we need to keep one of the identical rows and delete the others. That's exactly what we'll show you how to do now.

Method 1: Using DISTINCT and GROUP BY

The first method uses a combination of the DISTINCT and GROUP BY clauses to identify unique rows.

SELECT DISTINCT ON (name, email, age) * FROM users;

The DISTINCT ON clause specifies which columns should be taken into account to determine unique rows. In our example, we've specified name, email, and age to obtain distinct combinations of those attributes. The query will return the first occurrence of each unique combination of columns.

To delete the duplicate rows, we need to identify them and use the DELETE statement. In PostgreSQL, we can use a subquery to select the duplicate rows and join them to the original table to delete them. Here's how.

DELETE FROM users
WHERE id NOT IN (
  SELECT min(id)
  FROM users
  GROUP BY name, email, age
);

This query identifies the rows to keep by using the MIN aggregation function on the id column. This ensures we keep the first occurrence of each unique row. The WHERE clause then deletes any rows that are not included in the subquery. Once we run this query, we end up with the following table.

 id | name |     email      | age 
----+------+----------------+-----
  1 | John | john@example.com |  35
  2 | Mary | mary@example.com |  28
  4 | Alex | alex@example.com |  42
  5 | Mary | mary@example.com |  30

Method 2: Using ROW_NUMBER and Common Table Expressions (CTE)

The second method uses the ROW_NUMBER function and Common Table Expressions (CTE) to assign sequential numbers to the rows and then delete the duplicate rows.

WITH numbered_rows AS (
  SELECT id, name, email, age,
    ROW_NUMBER() OVER (PARTITION BY name, email, age ORDER BY id) AS r_id
  FROM users
)
DELETE FROM numbered_rows
WHERE r_id > 1;

This query creates a CTE called numbered_rows that includes all columns from the users table plus a calculated column r_id. The ROW_NUMBER function assigns an incremental number to the rows that have the same values in the columns specified in the PARTITION BY clause. In our example, we'll obtain the following numbered_rows table.

 id | name |     email      | age | r_id 
----+------+------+-----+-----
  1 | John | john@example.com |  35 |   1
  3 | John | john@example.com |  35 |   2
  2 | Mary | mary@example.com |  28 |   1
  5 | Mary | mary@example.com |  30 |   1
  4 | Alex | alex@example.com |  42 |   1
  6 | Alex | alex@example.com |  42 |   2

The DELETE statement then deletes the rows where the r_id is greater than one, which are the duplicates. After running this query, the table will look like this.

 id | name |     email      | age 
----+------+------+-----
  1 | John | john@example.com |  35
  2 | Mary | mary@example.com |  28
  4 | Alex | alex@example.com |  42
  5 | Mary | mary@example.com |  30

Method 3: Using DISTINCT and a Temporary Table

The third method avoids using a subquery and instead uses a temporary table to store the distinct rows before deleting the original table and copying the temporary table data back.

CREATE TEMPORARY TABLE temp_users AS
SELECT DISTINCT ON (name, email, age) * FROM users;

DROP TABLE users;
ALTER TABLE temp_users RENAME TO users;

This code creates a temporary table called temp_users that stores the distinct rows from the original users table using the same query as in Method 1. It then drops the original table and renames the temporary table back to users.

The downside of this approach is that it destroys the original table and its indexes, which may affect the performance of your queries if you have a large table. Also, keep in mind that temporary tables may not be supported by some DBMS.

Conclusion

In this article, we've explored different methods to delete duplicate rows while keeping one. We covered SQL syntax for PostgreSQL and MySQL databases, and provided practical code examples illustrating each method. By applying these techniques to your own projects, you'll be able to keep your data clean and optimize your database performance.

let me provide more context and details about some of the topics discussed in the previous article.

DISTINCT and GROUP BY

The DISTINCT clause removes duplicates from the result set of a SELECT query by returning only unique rows. It works on one or more columns specified in the SELECT statement. For instance, consider the following query:

SELECT DISTINCT name, email FROM users;

This query returns only unique combinations of the name and email columns from the users table.

On the other hand, the GROUP BY clause is used to group rows that have the same values in one or more columns and apply an aggregate function to each group. For example, consider the following query:

SELECT name, count(*) as num_users FROM users GROUP BY name;

This query groups the rows by the name column and counts the number of occurrences of each name. The result will look like this:

name | num_users
------+---------
John |        2
Mary |        2
Alex |        2

In the context of Method 1 from the previous article, we used the DISTINCT clause together with the GROUP BY clause to identify unique rows based on multiple columns. This allowed us to delete duplicates while keeping the first occurrence of each unique row.

ROW_NUMBER and CTEs

The ROW_NUMBER() function assigns sequential numbers to the rows within a partition of a result set. The PARTITION BY clause specifies the column(s) used to partition the rows, while the ORDER BY clause specifies the column(s) used to sort the rows within each partition.

In Method 2 from the previous article, we used ROW_NUMBER() along with a Common Table Expression (CTE) to assign sequential numbers to the rows that have the same values in the columns specified in the PARTITION BY clause. Then we used the WHERE clause to delete the rows where the r_id is greater than 1, which are the duplicates.

Using a CTE allowed us to define a named temporary result set that can be referenced multiple times in the same query. This helps us to simplify and organize our SQL code, especially when dealing with complex queries.

Temporary Tables

Temporary tables are tables that exist only during the lifetime of a session or a transaction and are automatically dropped when the session or transaction ends. They are useful for storing intermediate results or for managing data that is not part of the permanent schema.

In Method 3 from the previous article, we used a temporary table to store the distinct rows from the original table. First, we created the temporary table with the same schema as the original table, then we used the INSERT INTO SELECT statement with the DISTINCT clause to populate the temporary table. Finally, we dropped the original table and renamed the temporary table back to the original name.

Using a temporary table allowed us to avoid using a subquery to delete the duplicates and instead simplified the query into several steps.

Conclusion

In summary, removing duplicates from a table is an essential part of data management in SQL databases. There are various methods for identifying and deleting duplicates, and the choice depends on the specific SQL syntax and database engine used. Understanding the concepts of DISTINCT, GROUP BY, ROW_NUMBER, CTEs, and Temporary Tables can help you write efficient and effective SQL queries. With practice, you'll become proficient in managing duplicates in your SQL databases.

Popular questions

Q1: What is the purpose of deleting duplicate rows from a SQL table?
A1: The purpose of deleting duplicate rows from a SQL table is to remove redundant data that can potentially cause problems such as slower performance, inaccurate results, and increased database storage.

Q2: What is the difference between DISTINCT and GROUP BY in SQL?
A2: DISTINT and GROUP BY are both used to eliminate duplicates from a result set, but they work differently. DISTINCT removes duplicates from the entire result set while GROUP BY groups duplicates into subsets based on one or more columns and applies aggregate functions to each subset.

Q3: What is a Common Table Expression (CTE) in SQL?
A3: A Common Table Expression (CTE) is a named temporary result set that is defined within a SELECT, UPDATE, INSERT, or DELETE statement. It is useful for simplifying complex queries and improving code readability and maintainability.

Q4: How does the ROW_NUMBER() function work in SQL?
A4: The ROW_NUMBER() function assigns a sequential number to each row within a result set based on the specified PARTITION BY and ORDER BY clauses. It is commonly used to identify duplicate rows by assigning a unique number to each occurrence of a group of identical values.

Q5: What are Temporary Tables in SQL and how are they useful?
A5: Temporary Tables in SQL are tables that only exist for the duration of a session or transaction and are automatically dropped when the session or transaction ends. They are useful for managing intermediate results, temporary data, or data that is not part of the permanent schema, and can help simplify complex SQL queries.

Tag

"De-duplication"

I am a driven and diligent DevOps Engineer with demonstrated proficiency in automation and deployment tools, including Jenkins, Docker, Kubernetes, and Ansible. With over 2 years of experience in DevOps and Platform engineering, I specialize in Cloud computing and building infrastructures for Big-Data/Data-Analytics solutions and Cloud Migrations. I am eager to utilize my technical expertise and interpersonal skills in a demanding role and work environment. Additionally, I firmly believe that knowledge is an endless pursuit.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top