For working professionals
For fresh graduates
Study abroad
More

How to Delete Duplicate Rows in SQL

Updated on 02/02/2025525 Views

Table of Content

overview
deleting duplicate rows in sql
methods to identify duplicate rows
how do you eliminate duplicate rows in sql?
conclusion
faqs

When developing objects in SQL Server, we should adhere to a few recommended practices. To guarantee data integrity and performance, a table should, for instance, have primary keys, identity columns, clustered and non-clustered indexes, and constraints. Even when we adhere to best practices, problems like duplicate rows might still occur. Before entering the duplicate rows into the production tables, we want to remove them from any intermediate tables where we may receive these data through data import.

This tutorial explores several methods for removing duplicate rows in Oracle and MySQL. If duplicate rows are present in the SQL table, they must be eliminated.

Overview

Duplicate rows in SQL tables can arise for various reasons, such as data entry errors, software glitches, or inconsistencies in data integration processes. Regardless of the cause, it's crucial to implement strategies to identify and eliminate these duplicates to maintain data integrity and optimize performance.

Duplicate rows not only clutter the database but also lead to inefficiencies in querying and processing data. For instance, a table containing duplicate customer records may skew analytical reports or cause errors in billing systems. Moreover, redundant data occupies unnecessary storage space, impacting the overall performance of database operations, especially in large-scale systems.

In this article, we'll explore various methods to tackle duplicate rows in SQL tables, ranging from basic querying techniques to more advanced manipulation strategies. By understanding these methods, database administrators and developers can effectively clean up their databases and streamline data management processes. Let's delve into the details of each approach.

Deleting Duplicate Rows in SQL

Duplicate records in a SQL Server table can be very problematic. Orders may be handled more than once due to duplicate data, leading to inaccurate reporting outcomes. SQL Server offers numerous ways to handle duplicate records in a table, depending on the situation. They are as follows:

1. With Unique constraints

A table with a unique index can use the index to find duplicate data and then delete the duplicate records, per Delete duplicate rows in SQL. Identification can be done by self-joins, RANK function, sorting the data by maximum value, or NOT IN logic.

2. Without Constraints

Lack of Unique Constraints in Table Tables without a unique index may make deleting duplicate rows in SQL challenging. Using a common table expression (CTE) in conjunction with the ROW NUMBER() function, you can sort the data and eliminate duplicate rows.

Methods to Identify Duplicate Rows

Identifying duplicate rows in SQL is the initial step toward managing them effectively. Various techniques can be employed to identify duplicate records based on specific criteria. Here, we'll explore three common methods, detailed examples, and SQL queries to demonstrate their implementation.

1. Using COUNT() and GROUP BY Clause:

The COUNT() function in SQL counts the number of occurrences of a specified column or expression. When combined with the GROUP BY clause, it allows us to group rows based on one or more columns and perform aggregate functions on each group. This combination allows us to identify duplicate rows by counting the occurrences of unique values in selected columns.

Example:

Consider a table named 'employees' with the following structure:

employee_id	name	department
1	John	Sales
2	Alice	Marketing
3	John	Sales
4	Bob	HR
5	Alice	Marketing

In this example, we aim to identify duplicate rows based on the 'name' column.

To identify duplicate rows using COUNT() and GROUP BY, we can execute the following SQL query:

SELECT name, COUNT(*)
FROM employees
GROUP BY name
HAVING COUNT(*) > 1;

The SELECT statement retrieves the 'name' column and counts the occurrences of each unique name.
The GROUP BY clause groups the rows based on the 'name' column.
The HAVING clause filters the groups to include only those with a count greater than 1, indicating duplicates.

Result:

name	*COUNT()**
John	2
Alice	2

The query output shows that both 'John' and 'Alice' appear more than once in the 'employees' table, indicating the presence of duplicate records based on the 'name' column.

2. Utilizing self-joins to compare rows within the same table

This method involves performing a self-join operation on a table, allowing us to compare rows within the same table based on specific criteria. By joining the table to itself and specifying the conditions for comparison, we can identify duplicate rows effectively.

Example:

Suppose we have a table named 'orders' with columns 'order_id', 'customer_id', and 'order_date'. Let's say the table contains duplicate orders placed by the same customer on the same date, and we want to identify them. We can use the following SQL query:

SELECT o1.order_id, o1.customer_id, o1.order_date
FROM orders o1
INNER JOIN orders o2 ON o1.customer_id = o2.customer_id
AND o1.order_date = o2.order_date
AND o1.order_id <> o2.order_id;

In this query, we perform a self-join on the 'orders' table, aliasing it as 'o1' and 'o2'. We specify the conditions for comparison within the ON clause, including matching 'customer_id' and 'order_date' but excluding rows with the same 'order_id'.

3. Applying subqueries to identify rows with matching attributes

Subqueries can be used to identify duplicate rows by comparing rows with matching attributes. By using a subquery to check for the existence of other rows with the same attributes, we can pinpoint duplicates efficiently.

Example:

Consider a table named 'products' with columns 'product_id', 'product_name', and 'price'. Let's say the table contains duplicate products based on their names, and we want to identify them. We can use the following SQL query:

SELECT product_id, product_name, price
FROM products p1
WHERE EXISTS (
SELECT 1
FROM products p2
WHERE p1.product_name = p2.product_name
AND p1.product_id <> p2.product_id
);

In this query, we use a subquery within the WHERE clause to check for other products with the same name but different IDs (product_id). The main query selects the corresponding rows from the 'products' table if such products exist.

How Do You Eliminate Duplicate Rows in SQL?

1. Using ROW_NUMBER() with DELETE

The ROW_NUMBER() function in SQL assigns a unique sequential integer to each row within a partition of a result set. By leveraging this function and the DELETE statement, we can remove duplicate rows based on specific criteria.

Example:

Consider a table named 'students' with columns 'student_id', 'name', and 'score'. Let's say the table contains duplicate student names, and we want to remove them while retaining the row with the highest score.

Student_id	Name	Score
1	Alice	85
2	Bob	90
3	Alice	75
4	Bob	85
5	Charlie	95

We can use the ROW_NUMBER() function to achieve this.

WITH RankedStudents AS (
SELECT student_id, name, score,
ROW_NUMBER() OVER (PARTITION BY name ORDER BY score DESC) AS RowNum
FROM students
)
DELETE FROM RankedStudents
WHERE RowNum > 1;

In this query, we first use a Common Table Expression (CTE) named 'RankedStudents' to assign row numbers to each row within a partition defined by the 'name' column, ordered by 'score' in descending order. The ROW_NUMBER() function ensures that the row with the highest score for each student name gets assigned a row number of 1.

Then, we use the DELETE statement to delete rows from the 'RankedStudents' CTE where the row number is greater than 1, indicating duplicate rows.

The ROW_NUMBER() function assigns a unique row number to each row within a partition defined by the 'name' column.
The PARTITION BY clause partitions the result set by the 'name' column, ensuring that row numbers are assigned separately for each student name.
The ORDER BY clause orders the rows within each partition by the 'score' column in descending order, ensuring that the row with the highest score for each student name gets assigned a row number of 1.
The DELETE statement deletes rows from the 'RankedStudents' CTE where the row number is greater than 1, effectively removing duplicate rows while retaining the one with the highest score for each student name.

Result after executing the query

After executing the query, the duplicate rows with lower scores for each student name will be removed, leaving only the row with the highest score for each student name intact.

Student_id	Name	Score
1	Alice	85
2	Bob	90
3	Charlie	95

2. Using DISTINCT with DELETE

To illustrate this method, let's consider a scenario where we have a table named 'orders' with columns 'order_id', 'customer_id', and 'order_date'. We want to remove duplicate orders based on the combination of 'customer_id' and 'order_date', ensuring that only one order remains for each unique combination.

Before deletion, the 'orders' table may look like this:

order_id	customer_id	order_date
1	1001	2023-01-15
2	1002	2023-01-15
3	1001	2023-01-15
4	1001	2023-01-16
5	1003	2023-01-16

Here’s the :

DELETE FROM orders
WHERE (customer_id, order_date) NOT IN (
SELECT MIN(customer_id), order_date
FROM orders
GROUP BY order_date
);

This query deletes rows from the 'orders' table where the combination of 'customer_id' and 'order_date' is not equal to the minimum 'customer_id' for each 'order_date' combination. In other words, it keeps only the order with the minimum 'customer_id' for each unique 'order_date'.

After executing the deletion query, the 'orders' table will be updated as follows:

order_id	customer_id	order_date
1	1001	2023-01-15
2	1002	2023-01-15
4	1001	2023-01-16
5	1003	2023-01-16

Duplicate orders with the same 'customer_id' and 'order_date' combination have been removed, leaving only one order for each unique combination. This ensures data integrity and avoids redundancy in the 'orders' table.

3. Using Temporary Tables

Consider a scenario where we need to remove duplicate rows from a table named 'products' based on the 'product_name' column. Here's the step-by-step process:

product_id \|	product_name	price
1	Laptop	800
2	Mouse	20
3	Laptop	850
4	Keyboard	50
5	Mouse	25

Query using Temporary Tables

-- Create a temporary table to store distinct rows
CREATE TABLE #TempTable (
product_id INT PRIMARY KEY,
product_name VARCHAR(255),
price DECIMAL(10, 2)
);

-- Insert distinct rows into the temporary table
INSERT INTO #TempTable (product_id, product_name, price)
SELECT DISTINCT product_id, product_name, price
FROM products;

-- Truncate the original table
TRUNCATE TABLE products;

-- Insert rows from the temporary table back into the original table
INSERT INTO products (product_id, product_name, price)
SELECT product_id, product_name, price
FROM #TempTable;

-- Drop the temporary table
DROP TABLE #TempTable;

Result after executing the sql query for removing duplicate records

After executing the above sequence of SQL commands, the 'products' table will contain only unique rows with no duplicate product names.

product_id	product_name	price
1	Laptop	800
2	Mouse	20
4	Keyboard	50

Conclusion

Deleting duplicate rows in SQL is a crucial task in database management to ensure data accuracy and optimize performance. We can efficiently identify and remove duplicate records from tables by employing techniques such as DISTINCT with DELETE, ROW_NUMBER() with DELETE, and temporary tables. Implementing these methods improves data quality and enhances the overall efficiency of SQL queries and operations.

FAQs

How do I remove duplicate rows in SQL based on two columns?

To remove duplicate rows in SQL based on two columns, you can combine the DISTINCT keyword and a DELETE statement with the appropriate WHERE clause specifying the two columns.

How do I delete duplicate entries in SQL but keep one?

To remove duplicate rows in MySQL but keep one, use the ROW_NUMBER() function with a DELETE statement to keep one row and delete the rest based on specific criteria.

How do I remove duplicate rows from a table in SQL without Rowid?

You can delete duplicate rows in SQL without Rowid by using self-joins, subqueries, or temporary tables to identify and remove duplicates based on the desired criteria.

Does removing duplicates remove the entire row?

Removing duplicates in SQL typically removes only the duplicate rows, not the entire row. The DELETE statement is commonly used to remove duplicate rows while retaining unique ones.

Which command is used to remove duplicates?

The command used to remove duplicates in SQL is the DELETE statement, often combined with other SQL functions or clauses to specify the criteria for identifying duplicates.

Which command is used to duplicate the selected data?

To duplicate selected data in SQL, you can use the INSERT INTO statement with a SELECT subquery to copy data from an existing table or set the result into another table.

Rohan Vats

Author|408 articles published

Rohan Vats is a Senior Engineering Manager with over a decade of experience in building scalable frontend architectures and leading high-performing engineering teams. Holding a B.Tech in Computer Scie....

Join 10M+ Learners & Transform Your Career

Learn on a personalised AI-powered platform that offers best-in-class content, live sessions & mentorship from leading industry experts.

Free Courses

Explore Our Free Software Tutorials

Slide 1 of 3

Free Certificate

JavaScript Basics from Scratch

In this beginner-friendly course, you will learn the fundamentals of programming with Java by exploring topics such as data types and variables, conditional statements, loops, and functions.

17 Courses

Free Certificate

Product Management: Understanding the Market

In this course you'll be understanding the key component in a business model and also how to estimate the market size for any product or offering.

17 Courses

Free Certificate

Core Java Basics

In this course, you will learn the concept of variables and the various data types that exist in Java. You will get introduced to Conditional statements, Loops and Functions in Java.

17 Courses

upGrad Learner Support

Talk to our experts. We are available 7 days a week, 9 AM to 12 AM (midnight)

Indian Nationals

Foreign Nationals

Disclaimer

1.The above statistics depend on various factors and individual results may vary. Past performance is no guarantee of future results.

2.The student assumes full responsibility for all expenses associated with visas, travel, & related costs. upGrad does not provide any a.

How to Delete Duplicate Rows in SQL

Overview

Deleting Duplicate Rows in SQL

1. With Unique constraints

2. Without Constraints

Methods to Identify Duplicate Rows

1. Using COUNT() and GROUP BY Clause:

2. Utilizing self-joins to compare rows within the same table

3. Applying subqueries to identify rows with matching attributes

How Do You Eliminate Duplicate Rows in SQL?

1. Using ROW_NUMBER() with DELETE

2. Using DISTINCT with DELETE

3. Using Temporary Tables

product_id |

product_name

price

Query using Temporary Tables

Result after executing the sql query for removing duplicate records

product_id

product_name

price

Conclusion

FAQs

Free Courses

JavaScript Basics from Scratch

Product Management: Understanding the Market

Core Java Basics

upGrad Learner Support

Disclaimer

Top Resources