C# Dealing with duplicates

Introduction

Learn how to contend with duplicate data using ISet and HashSet. Code samples range from working with simply arrays, collections using mocked data to read an Excel WorkSheet.

By following along with the provided code a develope…


This content originally appeared on DEV Community and was authored by Karen Payne

Introduction

Learn how to contend with duplicate data using ISet and HashSet. Code samples range from working with simply arrays, collections using mocked data to read an Excel WorkSheet.

By following along with the provided code a developer can than consider using what is shown with various operations like reading incoming data then adding data to a internal data source like a database.

Note that these methods in general are better than using Distinct.

Definitions

ISet is an interface located in the System.Collections.Generic namespace,
designed to represent a collection of unique elements, ensuring no duplicates are stored. The primary aim of ISet is to facilitate the management of collections where the uniqueness of each element is paramount, providing efficient methods for set operations like union, intersection, and difference.

A HashSet is a collection of unique elements that uses a hash table for storage, allowing faster retrieval of elements than other collection types. Adding and removing elements to the HashSet also has constant time complexity. However, it does not maintain insertion order and cannot access elements by index.

Frozen Collections (which are used in several of the code samples) are collections optimized for situations where you have collections that will be frequently accessed, and you do not need to change the keys and values after creating. These collections are a bit slower during the creation, but reading operations are faster.

Examples

All code samples use mocked data with a small data set for easy of following along except the final example which uses an Excel file using a NuGet package ExcelMapper to read data. See also C# Excel read/write on the cheap for more on ExcelMapper.

Two models are used, both implement INotifyPropertyChanged which is not needed in regards to ensuring no duplication of data.

Source code

Example 1

When adding new items an average developer with an ISet wanting to add new items will resort to seeing if an item is contained in the set as shown below.

ISet<int> set = new HashSet<int> { 1, 2, 3 };

int[] array = [3, 4, 5];

foreach (var item in array)
{
    // ReSharper disable once CanSimplifySetAddingWithSingleCall
    if (!set.Contains(item))
    {
        set.Add(item);
    }
}

But there is no need to check if an item exists, instead the Add method will not add a new item if it already exists in the set.

Reside code

ISet<int> set = new HashSet<int> { 1, 2, 3 };

int[] array = [3, 4, 5];

foreach (var item in array)
{
    set.Add(item);
}

Example 2

Moving on to a more realistic scenario.

We have a model were to determine duplicates not all properties are needed e.g. the primary key should not be included, only FirstName, LastName and BirthDate. The best course is to implement IEquatable<Person> were the properties used to, in this case are used to define the properties used to determine duplication of items.

public class Person : INotifyPropertyChanged, IEquatable<Person>
{
    private int _id;
    private string _firstName;
    private string _lastName;
    private DateOnly _birthDate;

    public int Id
    {
        get => _id;
        set
        {
            if (value == _id) return;
            _id = value;
            OnPropertyChanged(nameof(Id));
        }
    }

    public string FirstName
    {
        get => _firstName;
        set
        {
            if (value == _firstName) return;
            _firstName = value;
            OnPropertyChanged(nameof(FirstName));
        }
    }

    public string LastName
    {
        get => _lastName;
        set
        {
            if (value == _lastName) return;
            _lastName = value;
            OnPropertyChanged(nameof(LastName));
        }
    }

    public DateOnly BirthDate
    {
        get => _birthDate;
        set
        {
            if (value.Equals(_birthDate)) return;
            _birthDate = value;
            OnPropertyChanged(nameof(BirthDate));
        }
    }

    public bool Equals(Person compareTo) 
        => (FirstName == compareTo.FirstName && 
            LastName == compareTo.LastName && 
            BirthDate == compareTo.BirthDate);

    public override int GetHashCode() 
        => HashCode.Combine(FirstName, LastName, BirthDate);

    public event PropertyChangedEventHandler? PropertyChanged;

    protected virtual void OnPropertyChanged(string propertyName)
    {
        PropertyChanged?.Invoke(this, new PropertyChangedEventArgs(propertyName));
    }
    public override string ToString() => $"{FirstName,-12}{LastName}";
}

Using the follow Set to representing existing data (no need for a large set of data).

private static ISet<Person> PeopleData()
{

    ISet<Person> peopleSet = new HashSet<Person>([
        new() { Id = 1, FirstName = "Karen", LastName = "Payne",
            BirthDate = new DateOnly(1956,9,24)},
        new() { Id = 2, FirstName = "Sam", LastName = "Smith",
            BirthDate = new DateOnly(1976,3,4) },
        new() { Id = 1, FirstName = "Karen", LastName = "Payne", 
            BirthDate = new DateOnly(1956,9,24) }
    ]);

    return peopleSet;
}

And for simplicity, add two new items using mocked data that in a real application might be an import from a file, database or web service.

As with shown in example for int, here only Frank Adams is added, rejecting Karen Payne as per IEquatable<Person> definition.

private static FrozenSet<Person> PeopleAdd()
{
    ShowExecutingMethodName();

    var peopleSet = PeopleData();

    peopleSet.Add(new() { Id = 3, FirstName = "Frank", LastName = "Adams", 
        BirthDate = new DateOnly(1966, 3, 4) });
    peopleSet.Add(new() { Id = 4, FirstName = "Karen", LastName = "Payne",
        BirthDate = new DateOnly(1956, 9, 24) });

    return peopleSet.ToFrozenSet();
}

Note
ToFrozenSet can be expensive to create but efficient for read operations.

Result

Shows result for adding

Example 3

In this example we will introduce UnionWith.

Base data.

private static ISet<Person> PeopleData()
{

    ISet<Person> peopleSet = new HashSet<Person>([
        new() { Id = 1, FirstName = "Karen", LastName = "Payne",
            BirthDate = new DateOnly(1956,9,24)},
        new() { Id = 2, FirstName = "Sam", LastName = "Smith",
            BirthDate = new DateOnly(1976,3,4) },
        new() { Id = 1, FirstName = "Karen", LastName = "Payne", 
            BirthDate = new DateOnly(1956,9,24) }
    ]);

    return peopleSet;
}

Add items with UnionWith.

var peopleSet = PeopleData();

peopleSet.UnionWith([
    new() { Id = 1, FirstName = "Karen", LastName = "Payne", 
        BirthDate = new DateOnly(1956,9,24)},
    new() { Id = 2, FirstName = "Sam", LastName = "Smith", 
        BirthDate = new DateOnly(1976,3,4) },
    new() { Id = 3, FirstName = "Frank", LastName = "Adams", 
        BirthDate = new DateOnly(1966,3,4) },
    new() { Id = 1, FirstName = "Karen", LastName = "Payne", 
        BirthDate = new DateOnly(1956,9,24) }
]);

Result

Shows result from UnionWith

Example 4

In this example we will introduce ExceptWith which removes all elements in the specified collection from the current set. This method is an O(n) operation, where n is the number of elements in the other parameter.

Base data

private static ISet<Person> PeopleData()
{

    ISet<Person> peopleSet = new HashSet<Person>([
        new() { Id = 1, FirstName = "Karen", LastName = "Payne",
            BirthDate = new DateOnly(1956,9,24)},
        new() { Id = 2, FirstName = "Sam", LastName = "Smith",
            BirthDate = new DateOnly(1976,3,4) },
        new() { Id = 1, FirstName = "Karen", LastName = "Payne", 
            BirthDate = new DateOnly(1956,9,24) }
    ]);

    return peopleSet;
}

Using ExceptWith

private static FrozenSet<Person> PeopleExceptWith()
{
    ShowExecutingMethodName();

    var peopleSet = PeopleData();

    peopleSet.ExceptWith([
        new() { Id = 2, FirstName = "Sam", LastName = "Smith", 
            BirthDate = new DateOnly(1976,3,4) },
        new() { Id = 3, FirstName = "Frank", LastName = "Adams", 
            BirthDate = new DateOnly(1966,3,4) },
    ]);


    return peopleSet.ToFrozenSet();
}

Result

Shows results for using ExceptWith

Example 5

In this example, data is read from Excel as shown below.

Excel sheet with two duplicate rows

Model for reading data using ExcelMapper to read the above work sheet.

Using Equals method the properties used define comparison are, Company (string), Country (string) and JoinDate (DateOnly).

public partial class Customers : INotifyPropertyChanged, IEquatable<Customers>
{
    public int Id { get; set; }

    public string Company { get; set; }

    public string ContactType { get; set; }

    public string ContactName { get; set; }

    public string Country { get; set; }

    public DateOnly JoinDate { get; set; }

    public event PropertyChangedEventHandler? PropertyChanged;

    protected virtual void OnPropertyChanged(string propertyName)
    {
        PropertyChanged?.Invoke(this, new PropertyChangedEventArgs(propertyName));
    }

    public bool Equals(Customers other)
    {
        if (ReferenceEquals(null, other)) return false;
        if (ReferenceEquals(this, other)) return true;
        return Company == other.Company && Country == other.Country && JoinDate.Equals(other.JoinDate);
    }

    public override bool Equals(object obj)
    {
        if (ReferenceEquals(null, obj)) return false;
        if (ReferenceEquals(this, obj)) return true;
        if (obj.GetType() != this.GetType()) return false;
        return Equals((Customers)obj);
    }

    public override int GetHashCode()
    {
        unchecked
        {
            var hashCode = (Company != null ? Company.GetHashCode() : 0);
            hashCode = (hashCode * 397) ^ (Country != null ? Country.GetHashCode() : 0);
            hashCode = (hashCode * 397) ^ JoinDate.GetHashCode();
            return hashCode;
        }
    }
}

When reading the sheet, the first row defines column names are not read.

  • Create an instance of ExcelMapper
  • Read the worksheet using ExcelMapper to a list.
  • Feed the above list to a HashSet which is then uses as an ISet.
  • Validate that no duplicates were added.
private static async Task ReadFromExcel()
{

    ShowExecutingMethodName1();

    const string excelFile = "ExcelFiles\\Customers.xlsx";
    ExcelMapper excel = new();

    var customers = (await excel.FetchAsync<Customers>(excelFile, nameof(Customers)))
        .ToList();

    AnsiConsole.MarkupLine($"[cyan]Read {customers.Count}[/]");

    /*
     * There are two duplicates so the next count is two less
     */
    ISet<Customers> customersSet = new HashSet<Customers>(customers);
    AnsiConsole.MarkupLine($"[cyan]Afterwards {customersSet.Count}[/]");

    List<Customers> customersList = [.. customersSet];

}

Result
Read 92
Afterwards 90

Summary

From the provided code sample to prevent duplication from predefined list becomes easy. And there are other examples provides like removal of items.

Word of advice, if an operation is for a large dataset that will eventually be pushed to a database consider using functionality in the database e.g. SQL-Server MERGE or creating a unique index.


This content originally appeared on DEV Community and was authored by Karen Payne


Print Share Comment Cite Upload Translate Updates
APA

Karen Payne | Sciencx (2024-07-21T14:49:24+00:00) C# Dealing with duplicates. Retrieved from https://www.scien.cx/2024/07/21/c-dealing-with-duplicates/

MLA
" » C# Dealing with duplicates." Karen Payne | Sciencx - Sunday July 21, 2024, https://www.scien.cx/2024/07/21/c-dealing-with-duplicates/
HARVARD
Karen Payne | Sciencx Sunday July 21, 2024 » C# Dealing with duplicates., viewed ,<https://www.scien.cx/2024/07/21/c-dealing-with-duplicates/>
VANCOUVER
Karen Payne | Sciencx - » C# Dealing with duplicates. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/07/21/c-dealing-with-duplicates/
CHICAGO
" » C# Dealing with duplicates." Karen Payne | Sciencx - Accessed . https://www.scien.cx/2024/07/21/c-dealing-with-duplicates/
IEEE
" » C# Dealing with duplicates." Karen Payne | Sciencx [Online]. Available: https://www.scien.cx/2024/07/21/c-dealing-with-duplicates/. [Accessed: ]
rf:citation
» C# Dealing with duplicates | Karen Payne | Sciencx | https://www.scien.cx/2024/07/21/c-dealing-with-duplicates/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.