1. Home
SQL

SQL Tutorial: Learn Structured Query Language Basics

Learn all SQL tutorial concepts in detail and master your career today.

  • 59
  • 9 Hours
right-top-arrow

Tutorial Playlist

81 Lessons
62

How to Delete Duplicate Rows in SQL

Updated on 19/07/2024445 Views

When developing objects in SQL Server, we should adhere to a few recommended practices. To guarantee data integrity and performance, a table should, for instance, have primary keys, identity columns, clustered and non-clustered indexes, and constraints. Even when we adhere to best practices, problems like duplicate rows might still occur. Before entering the duplicate rows into the production tables, we want to remove them from any intermediate tables where we may receive these data through data import.

This tutorial explores several methods for removing duplicate rows in Oracle and MySQL. If duplicate rows are present in the SQL table, they must be eliminated.

Overview

Duplicate rows in SQL tables can arise for various reasons, such as data entry errors, software glitches, or inconsistencies in data integration processes. Regardless of the cause, it's crucial to implement strategies to identify and eliminate these duplicates to maintain data integrity and optimize performance.

Duplicate rows not only clutter the database but also lead to inefficiencies in querying and processing data. For instance, a table containing duplicate customer records may skew analytical reports or cause errors in billing systems. Moreover, redundant data occupies unnecessary storage space, impacting the overall performance of database operations, especially in large-scale systems.

In this article, we'll explore various methods to tackle duplicate rows in SQL tables, ranging from basic querying techniques to more advanced manipulation strategies. By understanding these methods, database administrators and developers can effectively clean up their databases and streamline data management processes. Let's delve into the details of each approach.

Deleting Duplicate Rows in SQL

Duplicate records in a SQL Server table can be very problematic. Orders may be handled more than once due to duplicate data, leading to inaccurate reporting outcomes. SQL Server offers numerous ways to handle duplicate records in a table, depending on the situation. They are as follows:

1. With Unique constraints

A table with a unique index can use the index to find duplicate data and then delete the duplicate records, per Delete duplicate rows in SQL. Identification can be done by self-joins, RANK function, sorting the data by maximum value, or NOT IN logic.

2. Without Constraints

Lack of Unique Constraints in Table Tables without a unique index may make deleting duplicate rows in SQL challenging. Using a common table expression (CTE) in conjunction with the ROW NUMBER() function, you can sort the data and eliminate duplicate rows.

Methods to Identify Duplicate Rows

Identifying duplicate rows in SQL is the initial step toward managing them effectively. Various techniques can be employed to identify duplicate records based on specific criteria. Here, we'll explore three common methods, detailed examples, and SQL queries to demonstrate their implementation.

1. Using COUNT() and GROUP BY Clause:

The COUNT() function in SQL counts the number of occurrences of a specified column or expression. When combined with the GROUP BY clause, it allows us to group rows based on one or more columns and perform aggregate functions on each group. This combination allows us to identify duplicate rows by counting the occurrences of unique values in selected columns.

Example:

Consider a table named 'employees' with the following structure:

employee_id

name

department

1

John

Sales

2

Alice

Marketing

3

John

Sales

4

Bob

HR

5

Alice

Marketing

In this example, we aim to identify duplicate rows based on the 'name' column.

To identify duplicate rows using COUNT() and GROUP BY, we can execute the following SQL query:

SELECT name, COUNT(*)

FROM employees

GROUP BY name

HAVING COUNT(*) > 1;

  • The SELECT statement retrieves the 'name' column and counts the occurrences of each unique name.
  • The GROUP BY clause groups the rows based on the 'name' column.
  • The HAVING clause filters the groups to include only those with a count greater than 1, indicating duplicates.

Result:

name

COUNT(*)

John

2

Alice

2

The query output shows that both 'John' and 'Alice' appear more than once in the 'employees' table, indicating the presence of duplicate records based on the 'name' column.

2. Utilizing self-joins to compare rows within the same table

This method involves performing a self-join operation on a table, allowing us to compare rows within the same table based on specific criteria. By joining the table to itself and specifying the conditions for comparison, we can identify duplicate rows effectively.

Example:

Suppose we have a table named 'orders' with columns 'order_id', 'customer_id', and 'order_date'. Let's say the table contains duplicate orders placed by the same customer on the same date, and we want to identify them. We can use the following SQL query:

SELECT o1.order_id, o1.customer_id, o1.order_date

FROM orders o1

INNER JOIN orders o2 ON o1.customer_id = o2.customer_id

                     AND o1.order_date = o2.order_date

                     AND o1.order_id <> o2.order_id;

In this query, we perform a self-join on the 'orders' table, aliasing it as 'o1' and 'o2'. We specify the conditions for comparison within the ON clause, including matching 'customer_id' and 'order_date' but excluding rows with the same 'order_id'.

3. Applying subqueries to identify rows with matching attributes

Subqueries can be used to identify duplicate rows by comparing rows with matching attributes. By using a subquery to check for the existence of other rows with the same attributes, we can pinpoint duplicates efficiently.

Example:

Consider a table named 'products' with columns 'product_id', 'product_name', and 'price'. Let's say the table contains duplicate products based on their names, and we want to identify them. We can use the following SQL query:

SELECT product_id, product_name, price

FROM products p1

WHERE EXISTS (

    SELECT 1

    FROM products p2

    WHERE p1.product_name = p2.product_name

    AND p1.product_id <> p2.product_id

);

In this query, we use a subquery within the WHERE clause to check for other products with the same name but different IDs (product_id). The main query selects the corresponding rows from the 'products' table if such products exist.

How Do You Eliminate Duplicate Rows in SQL?

1. Using ROW_NUMBER() with DELETE

The ROW_NUMBER() function in SQL assigns a unique sequential integer to each row within a partition of a result set. By leveraging this function and the DELETE statement, we can remove duplicate rows based on specific criteria.

Example:

Consider a table named 'students' with columns 'student_id', 'name', and 'score'. Let's say the table contains duplicate student names, and we want to remove them while retaining the row with the highest score. 

Student_id

Name

Score

1

Alice

85

2

Bob

90

3

Alice

75

4

Bob

85

5

Charlie

95

We can use the ROW_NUMBER() function to achieve this.

WITH RankedStudents AS (

    SELECT student_id, name, score,

           ROW_NUMBER() OVER (PARTITION BY name ORDER BY score DESC) AS RowNum

    FROM students

)

DELETE FROM RankedStudents

WHERE RowNum > 1;

In this query, we first use a Common Table Expression (CTE) named 'RankedStudents' to assign row numbers to each row within a partition defined by the 'name' column, ordered by 'score' in descending order. The ROW_NUMBER() function ensures that the row with the highest score for each student name gets assigned a row number of 1.

Then, we use the DELETE statement to delete rows from the 'RankedStudents' CTE where the row number is greater than 1, indicating duplicate rows.

  • The ROW_NUMBER() function assigns a unique row number to each row within a partition defined by the 'name' column.
  • The PARTITION BY clause partitions the result set by the 'name' column, ensuring that row numbers are assigned separately for each student name.
  • The ORDER BY clause orders the rows within each partition by the 'score' column in descending order, ensuring that the row with the highest score for each student name gets assigned a row number of 1.
  • The DELETE statement deletes rows from the 'RankedStudents' CTE where the row number is greater than 1, effectively removing duplicate rows while retaining the one with the highest score for each student name.

Result after executing the query

After executing the query, the duplicate rows with lower scores for each student name will be removed, leaving only the row with the highest score for each student name intact.

Student_id

Name

Score

1

Alice

85

2

Bob

90

3

Charlie

95


2. Using DISTINCT with DELETE

To illustrate this method, let's consider a scenario where we have a table named 'orders' with columns 'order_id', 'customer_id', and 'order_date'. We want to remove duplicate orders based on the combination of 'customer_id' and 'order_date', ensuring that only one order remains for each unique combination.

Before deletion, the 'orders' table may look like this:

order_id

customer_id

order_date

1

1001

2023-01-15

2

1002

2023-01-15

3

1001

2023-01-15

4

1001

2023-01-16

5

1003

2023-01-16

Here’s the :

DELETE FROM orders

WHERE (customer_id, order_date) NOT IN (

    SELECT MIN(customer_id), order_date

    FROM orders

    GROUP BY order_date

);

This query deletes rows from the 'orders' table where the combination of 'customer_id' and 'order_date' is not equal to the minimum 'customer_id' for each 'order_date' combination. In other words, it keeps only the order with the minimum 'customer_id' for each unique 'order_date'.

After executing the deletion query, the 'orders' table will be updated as follows:

order_id

customer_id

order_date

1

1001

2023-01-15

2

1002

2023-01-15

4

1001

2023-01-16

5

1003

2023-01-16

Duplicate orders with the same 'customer_id' and 'order_date' combination have been removed, leaving only one order for each unique combination. This ensures data integrity and avoids redundancy in the 'orders' table.

3. Using Temporary Tables

Consider a scenario where we need to remove duplicate rows from a table named 'products' based on the 'product_name' column. Here's the step-by-step process:

product_id |

product_name

price

1

Laptop 

800

2

Mouse

20

3

Laptop

850

4

Keyboard

50

5

Mouse

25

Query using Temporary Tables

-- Create a temporary table to store distinct rows

CREATE TABLE #TempTable (

    product_id INT PRIMARY KEY,

    product_name VARCHAR(255),

    price DECIMAL(10, 2)

);

-- Insert distinct rows into the temporary table

INSERT INTO #TempTable (product_id, product_name, price)

SELECT DISTINCT product_id, product_name, price

FROM products;

-- Truncate the original table

TRUNCATE TABLE products;

-- Insert rows from the temporary table back into the original table

INSERT INTO products (product_id, product_name, price)

SELECT product_id, product_name, price

FROM #TempTable;

-- Drop the temporary table

DROP TABLE #TempTable;

Result after executing the sql query for removing duplicate records

After executing the above sequence of SQL commands, the 'products' table will contain only unique rows with no duplicate product names.

product_id

product_name

price

1

Laptop

800

2

Mouse

20

4

Keyboard

50

Conclusion

Deleting duplicate rows in SQL is a crucial task in database management to ensure data accuracy and optimize performance. We can efficiently identify and remove duplicate records from tables by employing techniques such as DISTINCT with DELETE, ROW_NUMBER() with DELETE, and temporary tables. Implementing these methods improves data quality and enhances the overall efficiency of SQL queries and operations.

FAQs

  1. How do I remove duplicate rows in SQL based on two columns?

To remove duplicate rows in SQL based on two columns, you can combine the DISTINCT keyword and a DELETE statement with the appropriate WHERE clause specifying the two columns.

  1. How do I delete duplicate entries in SQL but keep one?

To remove duplicate rows in MySQL but keep one, use the ROW_NUMBER() function with a DELETE statement to keep one row and delete the rest based on specific criteria.

  1. How do I remove duplicate rows from a table in SQL without Rowid?

You can delete duplicate rows in SQL without Rowid by using self-joins, subqueries, or temporary tables to identify and remove duplicates based on the desired criteria.

  1. Does removing duplicates remove the entire row?

Removing duplicates in SQL typically removes only the duplicate rows, not the entire row. The DELETE statement is commonly used to remove duplicate rows while retaining unique ones.

  1. Which command is used to remove duplicates?

The command used to remove duplicates in SQL is the DELETE statement, often combined with other SQL functions or clauses to specify the criteria for identifying duplicates.

  1. Which command is used to duplicate the selected data?

To duplicate selected data in SQL, you can use the INSERT INTO statement with a SELECT subquery to copy data from an existing table or set the result into another table.

Rohan Vats

Rohan Vats

Passionate about building large scale web apps with delightful experiences. In pursuit of transforming engineers into leaders.

Get Free Career Counselling
form image
+91
*
By clicking, I accept theT&Cand
Privacy Policy
image
Join 10M+ Learners & Transform Your Career
Learn on a personalised AI-powered platform that offers best-in-class content, live sessions & mentorship from leading industry experts.
right-top-arrowleft-top-arrow

upGrad Learner Support

Talk to our experts. We’re available 24/7.

text

Indian Nationals

1800 210 2020

text

Foreign Nationals

+918045604032

Disclaimer

upGrad does not grant credit; credits are granted, accepted or transferred at the sole discretion of the relevant educational institution offering the diploma or degree. We advise you to enquire further regarding the suitability of this program for your academic, professional requirements and job prospects before enr...