Topic: Accidental reposts

Posted under General

I know the system is set up to prevent reposts, but sometimes it happens anyway...I also don't think it catches different size images, but we want the bigger ones anyway.

A year ago, I uploaded post #114873, which was identical to post #27247 in size, color, everything (it was only available on transfur.com). For whatever reason, the site didn't catch it. (I later noticed and got it removed)

I'm wondering if this was simply a glitch in the system, it was recently updated, or just how it normally functions.

Updated

The system is set up to DETER reposts, not prevent them.
Even if a post looks identical to another in every single way, there can still be differences in the image data that cause it to not be picked up by the dupe detector.

Updated by anonymous

Riversyde said:
The system is set up to DETER reposts, not prevent them.
Even if a post looks identical to another in every single way, there can still be differences in the image data that cause it to not be picked up by the dupe detector.

i see. well thanks for clarifying that.

Updated by anonymous

The MD5 hash takes into account not only the pixels in the image, but other data in the image such as file creation time, JPG EXIF data, tags (if added by a program such as Picasa). We use MD5 hashes to detect dupes because it works 95% of the time and it's extremely simple to code. The ones that slip through are easy enough to delete by hand.

Updated by anonymous

This makes me wonder if there's a way to just ignore the file wrapper and check just the pixels, catching that remaining 5%. After all the md5 function can be used with any string right?

Updated by anonymous

Adrian_Blazevic said:
This makes me wonder if there's a way to just ignore the file wrapper and check just the pixels, catching that remaining 5%. After all the md5 function can be used with any string right?

I guess you could render the pic into a buffer and apply the md5 algorithm to that buffer.

Updated by anonymous

Munkelzahn said:
I guess you could render the pic into a buffer and apply the md5 algorithm to that buffer.

If that were easy to do, it would be even easier to skip the render and md5sum the raw image data after the file header.

Granted, that wouldn't detect identical images uploaded in different file formats, but you still have exactly the same problem when talking about comparing any two images in different formats when at least one is using a lossy compression (*cough*JPGs*cough*). Doing the render would only additionally catch identical images that are both represented in lossless formats, or a lossless conversion from a lossy format. These are less common cases than comparing two JPGs or a JPG created from a lossless original.

md5sum'ing the image payload of the file would still catch cases where someone nuked the headers or metadata but left the image data itself alone and in the same format as before.

In any event, trying to improve the detection will still only catch a small number of outliers, since most dupes have material changes in the image: the artist's sig is erased, watermarks are removed, or the image is resized; or someone was simply a douche and turned the compression down to $#@! because a 3000x3000 image looks perfectly fine on their 10" CRT at 640x480 at 30% JPG quality, and it's only 50KB! Such a savvy and considerate use of image editing software! (Some hyperbole may exist in that last example, for the sake of entertainment.)

Updated by anonymous

True. It would be interesting to see some use of image similarity algorithms but I don't know how intensive / trade secretive those are.

Come to think of it, Google Images lists e621.net images. Maybe one could search by image and filter to show only e621 results.

Updated by anonymous

The only way I know how to implement would mean breaking out OpenCV. Certainly a widely studied topic in computer science.

This is probably what we want: http://phash.org/

It is a hash, with an associated hash-comparison function that returns the estimated similarity of two images given their hashes (so, not a cryptographic hash!). The hash-comparison function is simple enough it could be implemented in some databases, the hash itself seems to be able to process a dozen images a second on a relatively old computer here.

The program is quite impressive, it seems to resist compression, resizing, slight rotation, watermarks, pretty much any reason we would mark an image as a duplicate.

Updated by anonymous

Adrian_Blazevic said:
True. It would be interesting to see some use of image similarity algorithms but I don't know how intensive / trade secretive those are.

Come to think of it, Google Images lists e621.net images. Maybe one could search by image and filter to show only e621 results.

one can do that now by adding "site:e621.net" to the beginning of the search.

Updated by anonymous

  • 1