Topic: JPG IPTC-tag

Posted under General

I was wondering if the booru-system (or whatever its correct name might be) features the possibility of transfering booru-tags into the IPTC-tag of jpg-pictures.
It'd be pretty useful to have categories and the artist stored right in the image file.

Updated by VulpesFoxnik

ktkr

Former Staff

I say no:
1.By storing info inside the file itself, the hash is automatically modified, leading to a large increase in duplicates(esp. when changing booru-tags) on this board and on other boards as well.
2.Any original info that was in the IPTC-tag will be lost.
3.This will require extra programing for every format(jpg, gif, png, flash).

Updated by anonymous

ktkr said:
I say no:
1.By storing info inside the file itself, the hash is automatically modified, leading to an increase in duplicates(esp. when changing booru-tags) on this board and on other boards as well.
2.Any original info that was in the IPTC-tags will be lost.
3.This will require extra programing for every format(jpg, gif, png, flash)

I've thought of this for my own personal databae. The only way to compensate for it is to store the data in the exif/IPTC tags, as well as the artist and the origonal file name. Not to mention your right about some formats using a completely different method of writing and reading them. The old meta data can be imported if it exists, and appended using a clear marker string of the start of the system's own meta data.

Flash however poses the most serious problem. I don't think they support meta data at all. The only other was of storing such this is with a text file either concatenated at the end, or by simply pairing them with a text file with the same name, in this case, the file's MD5 sum.

I would also like to point out that MD5 are not good hashing values. They do collide from time to time, but from my understanding is that they must be a much larger file than the other.

Updated by anonymous

Thanks for your input guys

VulpesFoxnik said:
I've thought of this for my own personal databae. The only way to compensate for it is to store the data in the exif/IPTC tags, as well as the artist and the origonal file name. Not to mention your right about some formats using a completely different method of writing and reading them. The old meta data can be imported if it exists, and appended using a clear marker string of the start of the system's own meta data.

Flash however poses the most serious problem. I don't think they support meta data at all. The only other was of storing such this is with a text file either concatenated at the end, or by simply pairing them with a text file with the same name, in this case, the file's MD5 sum.

I would also like to point out that MD5 are not good hashing values. They do collide from time to time, but from my understanding is that they must be a much larger file than the other.

(I've heard png has a tag with similar information-fields like IPTC but it's not widely supported (e.g. Irfanview doesn't).)
edit: okay, this guy says no: http://tech.kateva.org/2006/04/why-png-sucks-its-metadata-stupid.html
apparently png lets you define custom fields but nobody has figured out how to get it to work

gif doesn't have anything like it either.
Personally I don't think it's needed for flash because given the small amount of worthwhile animations out there you'll find the one you're looking for on your hdd in no time.

Since the majority of pictures are stored as jpg I don't think anyone would complain if the tag-transfer was only implemented for those files. That is if the hash/duplicate problem can be bypassed somehow but I'm not familiar with the technicalities..

Updated by anonymous

I don't think that e621 should modify the contents of image files, since it changes the hash, and makes duplicates significantly harder to spot without fuzzy-matching.

Using the filename of an image to store the tags is a much better idea, since it doesn't change the hash of the file.

Updated by anonymous

Kitsu~ said:
I don't think that e621 should modify the contents of image files, since it changes the hash, and makes duplicates significantly harder to spot without fuzzy-matching.

Using the filename of an image to store the tags is a much better idea, since it doesn't change the hash of the file.

that'd be a fast and risk-free solution, unfortunately it only allows for a limited amount of tags. But being the tag-fetishist that I am I might be the only one who has a problem with that.

fuzzy-matching is not an option because it's too much work i suppose?

Updated by anonymous

SpanKing said:
that'd be a fast and risk-free solution, unfortunately it only allows for a limited amount of tags. But being the tag-fetishist that I am I might be the only one who has a problem with that.

fuzzy-matching is not an option because it's too much work i suppose?

It's not a perfect way of detecting duplicates, especially with black and white images.

Updated by anonymous

Kitsu~ said:
It's not a perfect way of detecting duplicates, especially with black and white images.

And that it is processor intensive, more so than a simple md5, and it still requires a human's oversight.

Updated by anonymous

VulpesFoxnik said:
And that it is processor intensive, more so than a simple md5, and it still requires a human's oversight.

Actually, it isn't as intensive as you might think, in my image collection application, I can compare an image to 200k other images in less than 0.001 seconds. Although it is mostly useless with non-colour images.

Updated by anonymous

Kitsu~ said:
Actually, it isn't as intensive as you might think, in my image collection application, I can compare an image to 200k other images in less than 0.001 seconds. Although it is mostly useless with non-colour images.

What image comparison program are you using? I've been using findimagedup, and it takes forever.

Updated by anonymous

VulpesFoxnik said:
What image comparison program are you using? I've been using findimagedup, and it takes forever.

I built my own in PHP, it resizes an image to 3x3 and stores each pixel's red, green and blue values in MySQL. Then it just compares one image to the database by checking for images with similar coloured pixels.

It works surprisingly well despite it's simplicity.

Updated by anonymous

Kitsu~ said:
I built my own in PHP, it resizes an image to 3x3 and stores each pixel's red, green and blue values in MySQL. Then it just compares one image to the database by checking for images with similar coloured pixels.

It works surprisingly well despite it's simplicity.

That sounds nice, so this could actually be implemented into this site? In case there's more than one result the site could display a list of them.. If 9 pixels aren't accurate enough a 4*4 array shouldn't need that much additional processing power either.

@Vulpes
If you need a good duplicate finder, check out Visipics

Updated by anonymous

SpanKing said:
That sounds nice, so this could actually be implemented into this site? In case there's more than one result the site could display a list of them.. If 9 pixels aren't accurate enough a 4*4 array shouldn't need that much additional processing power either.

@Vulpes
If you need a good duplicate finder, check out Visipics

This software that runs the e621 site, Jikanbako, is written in ruby. Although the software language differences are most likely minor between Ruby and Php, php is not thread safe, and is relatively unsuited for complex web servers, and generally is a poor language to write it because of it's poor handling of variables. (No offense Kitsu~, all of my stuff is written in bash, when I know I should probably convert to ruby or python).

And I only run linux. It looks like Visipics is windows software, and honestly I don't want to be running in 32bit mode for this process.

Updated by anonymous

How about imagemagick's compare tool then?

Updated by anonymous

Jazz said:
How about imagemagick's compare tool then?

I wasn't aware it had one, however it does make sense since it has Perl plug-ins... I'm not sure if ruby does or not, it may have to use pipe communication to operate with both bash and ruby.

Honestly, all I've used for imagemagick for is to resize images. ( I was normalizing my desktop pictures.)

Updated by anonymous

VulpesFoxnik said:
No offense Kitsu~, all of my stuff is written in bash, when I know I should probably convert to ruby or python

None taken, I have ditched PHP over a month ago, and I am rewriting it in Python now.

I am a Linux user too, IIRC DigiKam has a fuzzy dupe checker.

Updated by anonymous

A 3*3*(256^3) pixel comparator that stores colours is going to be pretty awful for a database the size of e621's. Many, many false positives, amongst greyscale stuff especially.

If you're looking to implement a fuzzy checker, you want to be thinking outside raw RGB and pixels. Break the fucker down into its component wavelets, and store just the most important bits: http://grail.cs.washington.edu/projects/query/mrquery.pdf for example: this is the one used in ImgSeek. pywt and numpy, with a little PIL do the job for me.

Updated by anonymous

Anomynous said:
A 3*3*(256^3) pixel comparator that stores colours is going to be pretty awful for a database the size of e621's. Many, many false positives, amongst greyscale stuff especially.

If you're looking to implement a fuzzy checker, you want to be thinking outside raw RGB and pixels. Break the fucker down into its component wavelets, and store just the most important bits: http://grail.cs.washington.edu/projects/query/mrquery.pdf for example: this is the one used in ImgSeek. pywt and numpy, with a little PIL do the job for me.

ImgSeek is what I currently use, however it's slow and bulky, and doesn't work on my screen because it uses written in font sizes because of the age of the python.

Updated by anonymous

  • 1