imagededup

? Finding duplicate images made easy!

Github星跟蹤圖

Image Deduplicator (imagededup)

Build Status
Build Status
Docs
codecov
PyPI Version
License

imagededup is a python package that simplifies the task of finding exact and near duplicates in an image collection.

This package provides functionality to make use of hashing algorithms that are particularly good at finding exact
duplicates as well as convolutional neural networks which are also adept at finding near duplicates. An evaluation
framework is also provided to judge the quality of deduplication for a given dataset.

Following details the functionality provided by the package:

Detailed documentation for the package can be found at: https://idealo.github.io/imagededup/

imagededup is compatible with Python 3.6+ and runs on Linux, MacOS X and Windows.
It is distributed under the Apache 2.0 license.

? Contents

⚙️ Installation

There are two ways to install imagededup:

  • Install imagededup from PyPI (recommended):
pip install imagededup

⚠️ Note: The TensorFlow >=2.1 and TensorFlow 1.15 release now include GPU support by default.
Before that CPU and GPU packages are separate. If you have GPUs, you should rather
install the TensorFlow version with GPU support especially when you use CNN to find duplicates.
It's way faster. See the TensorFlow guide for more
details on how to install it for older versions of TensorFlow.

  • Install imagededup from the GitHub source:
git clone https://github.com/idealo/imagededup.git
cd imagededup
pip install "cython>=0.29"
python setup.py install

? Quick Start

In order to find duplicates in an image directory using perceptual hashing, following workflow can be used:

  • Import perceptual hashing method
from imagededup.methods import PHash
phasher = PHash()
  • Generate encodings for all images in an image directory
encodings = phasher.encode_images(image_dir='path/to/image/directory')
  • Find duplicates using the generated encodings
duplicates = phasher.find_duplicates(encoding_map=encodings)
  • Plot duplicates obtained for a given file (eg: 'ukbench00120.jpg') using the duplicates dictionary
from imagededup.utils import plot_duplicates
plot_duplicates(image_dir='path/to/image/directory',
                duplicate_map=duplicates,
                filename='ukbench00120.jpg')

The output looks as below:

The complete code for the workflow is:

from imagededup.methods import PHash
phasher = PHash()

# Generate encodings for all images in an image directory
encodings = phasher.encode_images(image_dir='path/to/image/directory')

# Find duplicates using the generated encodings
duplicates = phasher.find_duplicates(encoding_map=encodings)

# plot duplicates obtained for a given file using the duplicates dictionary
from imagededup.utils import plot_duplicates
plot_duplicates(image_dir='path/to/image/directory',
                duplicate_map=duplicates,
                filename='ukbench00120.jpg')

For more examples, refer this part of the
repository.

For more detailed usage of the package functionality, refer: https://idealo.github.io/imagededup/

⏳ Benchmarks

Detailed benchmarks on speed and classification metrics for different methods have been provided in the documentation.
Generally speaking, following conclusions can be made:

  • CNN works best for near duplicates and datasets containing transformations.
  • All deduplication methods fare well on datasets containing exact duplicates, but Difference hashing is the fastest.

? Contribute

We welcome all kinds of contributions.
See the Contribution guide for more details.

? Citation

Please cite Imagededup in your publications if this is useful for your research. Here is an example BibTeX entry:

@misc{idealods2019imagededup,
  title={Imagededup},
  author={Tanuj Jain and Christopher Lennan and Zubin John and Dat Tran},
  year={2019},
  howpublished={\url{https://github.com/idealo/imagededup}},
}

? Maintainers

See LICENSE for details.

主要指標

概覽
名稱與所有者idealo/imagededup
主編程語言Python
編程語言Python (語言數: 4)
平台
許可證Apache License 2.0
所有者活动
創建於2019-04-05 12:10:54
推送於2025-05-15 14:54:38
最后一次提交2025-05-07 21:37:02
發布數9
最新版本名稱v03.3 (發布於 )
第一版名稱v0.1.0 (發布於 )
用户参与
星數5.4k
關注者數63
派生數465
提交數531
已啟用問題?
問題數131
打開的問題數37
拉請求數64
打開的拉請求數8
關閉的拉請求數27
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?