Removing duplicate images

Enhance datasets by detecting and eliminating duplicate and near-duplicate images

The presence of duplicate or closely similar images can introduce bias in deep learning models. Encord Active provides the capability to identify and eliminate these duplicate or near-duplicate images from datasets. This process contributes to enhancing data quality by removing redundant instances, ultimately leading to improved model performance.

In this workflow, the Image Singularity quality metric is employed to identify duplicate and near-duplicate images.

Image singularity

The Image singularity metric evaluates all images within the dataset and assigns a uniqueness score to each, indicating their distinctiveness.

  • The uniqueness score falls within the [0,1] range. A higher score indicates a greater level of image uniqueness.
  • A score of zero signifies the presence of at least one identical image within the dataset. For instances with N duplicate images, N-1 of them are assigned a score of zero (with only one holding a non-zero score) to facilitate their exclusion from the dataset.
  • Near-duplicate images are labeled as Near-duplicate image and are presented side by side in the Explorer's grid view. This setup simplifies the decision-making process when selecting which image to keep and which one to remove.

Walkthrough

Go to the Data tab within the Summary page and pick the Image Singularity quality metric from the drop-down menu in the Metric Distribution section.

Distribution of data based on Image Singularity scores

The chart displays the distribution of data based on the Image Singularity scores. The example image illustrates a project containing around 200 duplicate images.

Proceed to the Explorer page and choose the Image Singularity quality metric from the Order by drop-down. This menu is positioned above the natural language search bar and enables data to be organized according to the chosen criteria.

Ordering data by Image Singularity

Choose any sample and click its corresponding Similar items button. This action will display images similar to the selected one, including any duplicates if they exist.

Displaying similar images based on the similarity search query

Removing duplicate images

In situations where users aim to eliminate duplicate images from a dataset, they can flag these images and create a subset of the dataset devoid of duplicates.

  1. Access the Explorer page and ensure that a data metric is chosen in the Group by dropdown. This steps ensures that the Explorer's grid view shows data items.
  2. Tag all images with a data tag, such as non-duplicate images, by utilizing the SELECT ALL button followed by the TAG button. This operation is known as bulk tagging. Afterwards, click the CLEAR SELECTION button to reset the selection.
  3. Opt for the Image Singularity quality metric within the FILTERS button. Adjust the range slider for this metric to cover the entire range available. This step involves the standard filter.
  4. Click the SELECT ALL button to choose all image duplicates. Then, utilize the TAG button to remove the non-duplicate images tag from this subset. Upon completion, click both the RESET FILTERS and CLEAR SELECTION buttons to reset the selections. As a result, the subset labeled with the non-duplicate images tag will now exclusively consist of images that are not duplicated.
  5. Choose the Data Tags option within the FILTERS button. Ensure that only the non-duplicate images tag is selected.
  6. Click the CREATE PROJECT SUBSET button and follow the provided instructions to generate a project containing exclusively non-duplicate images.

Incorporating this workflow into dataset management strategies can significantly enhance data quality, eliminate redundancies, and contribute to more accurate model training and evaluation.

Removing near-duplicate images

An example of near-duplicate image pairs detected with Encord Active

An example of near-duplicate image pairs detected with Encord Active

Similar to duplicates, near-duplicate images are those where one image slightly differs from another due to shifts, blurriness, or distortion. Consequently, they should also be eliminated from the dataset. However, in this scenario, a decision is required to determine which sample remains and which is discarded. These images possess scores marginally greater than 0 and are displayed alongside one another in the grid view, facilitating easy comparison.

To proceed:

  1. Tag all images with a data tag, by utilizing the SELECT ALL button followed by the TAG button. Afterwards, click the CLEAR SELECTION button to reset the selection.
  2. To focus on images with remarkably low uniqueness scores, opt for the Image Singularity quality metric within the FILTERS button and adjust the range slider for this metric to cover the range [0,0.05].
  3. Examine the images and proceed to remove the tag from images intended for exclusion from the project.
  4. Follow the same export steps as outlined in the Removing duplicate images section.

With these actions, users can efficiently manage near-duplicate images and improve dataset quality.