Topic: [Feature] Non-destructively embed e621 metadata into uploaded images

Posted under Site Bug Reports & Feature Requests

Requested feature overview description.
When a user uploads or edits a picture, add metadata onto the original with the information, such as tags, description, sources, upload date, revision date, and ID, prefixed with " from_e621.net: ". Metadata would be updated with new tags on each edit.
Why would it be useful?
*Develops redundancies for links and information in case any of the original sources go offline
*Makes it easier for users to organise personal galleries saved on a hard disk
*Supports the idea of an Internet archive, where all images can be saved and sourced for the future
What part(s) of the site page(s) are affected?
Every image post.
Notes:
Granted the sight has an obligation to be an archive, and can't mess with the image too much. But if we are being an archive, it should be necessary that, if the site or the post goes down for any reason, the user still has a saved history of what existed on that post before it went down.

Metadata doesn't impact image quality, only takes a few bytes to add on, and the non-destructive aspect means that anything the artist embedded prior to the e621 upload will remain intact.

I don't have a specific implementation or an idea of how this would affect the website's efficiency. I also think .jpg has a limitation where data can't be too long or it gets truncated. I understand this isn't helpful, but bear in mind I'm not a programmer (big red warning flag there).

Other limitations:
*Images saved to the users drive will have slightly different data as a post changes, which is a universal issue, seeing as images on a disk don't automatically update when they do online
*Would be tricky to program so the data doesn't destroy itself, destroy the original metadata, or create messy, redundant data (which is why it's a valuable skill to be able to make these things)
*Would add a few hundred bytes to the filesize of the image, which is negligible compared to how large the things are.
*Not sure if this is feasible for videos

Updated by savageorange

Siral_Exan said:
Are you referring to tag, description, and note history?

Oh dear, not the ENTIRE history, but what's currently visible on the page when the user saves an image.

To be honest, I think it would be infeasible to comfortably add in note data into an image.

Updated by anonymous

Any alteration to file itself would alter its MD5 checksum, so -1

Also in general againts altering the original images in any way, unless necessary e.g. file format conversion. You can always do altering from original source, but you can't create original source from altered versions.

Updated by anonymous

Personally I'd love to have multiple download options, one of them could have the current tags and source added as metadata, another is the pure file as it is currently shown, another could be one with the original filename intact, etc. etc.
Though I have absolutely no idea how this could be implemented or how much resources this would cost to keep up.

Updated by anonymous

Mario69 said:
Any alteration to file itself would alter its MD5 checksum, so -1

Also in general againts altering the original images in any way, unless necessary e.g. file format conversion. You can always do altering from original source, but you can't create original source from altered versions.

What is the checksum used for in the current infrastructure? It might be possible to add the data after the image has been uploaded, but not so the sum is altered.

NotMeNotYou said:
Personally I'd love to {...}

I predict the big problems would be 1. writing a script that fits the design goals of non-destructive and flexible metadata adding (a one-time issue), 2. making sure the script doesn't result in higher bandwidth / CPU usage, 3. Making sure the user knows how to use the thing.

Ideally it would only need one image. I wonder if it's possible to use an already cached image and then add metadata to it once the user wants to select a download option. But then that's for multiple downloads - a single download with every image having the embedded metadata would avoid these issues.

Updated by anonymous

fewrahuxo said:
What is the checksum used for in the current infrastructure?

Search ('md5:foobar'), duplicate post detection.

Offline duplicate-detection tools (eg. fdupes, fslint, rmlint) also use the same mechanism, and naturally will be affected whether they use md5, sha1, sha256 or whatever - hashes.

I've had a discussion with the author of some e621-related tool on these forums, about a similar idea. I can dig it up if you want.

(Personally I prefer using an independent tagging solution -- TMSU -- as that doesn't cause the complications that come with modifying files)

EDIT: Some info about metadata situation with various file formats here

There are a number of formats(eg. jpg, gif, png, mp4, pdf) where it is possible to simply append arbitrary bytes to the end of the file. Assuming that you are OK with changing the checksum, then you can, say, concatenate a zipfile-containing-your-metadata to the image file.
(zipfile is nice because it's a generic container format with compression and error correction, and pretty much universally usable. And it doesn't care if there is 'junk data' coming before the zipfile data.).

The downside of that is if you modify the file at all (even just editing EXIF data), the concatenated data won't be preserved.

The upside I guess is that it would be probably the most 'universally applicable' approach (and wouldn't require special tools to access the data)

Updated by anonymous

Is there an anti-duplicate solution that doesn't rely on checksums?

Updated by anonymous

Well, yes. Image search type algorithms, where the 'fingerprint' is calculated from the actual image pixels.
However:
a) this is *resource intensive* (like, doing it everytime someone tries to upload a new post, to check whether it would be a duplicate, isn't likely to be very practical. Though there are strategies to ameliorate this if you just want to find -exact- matches.)
b) It only works on certain file formats (for example, it wouldn't work on SWF, as it's interactive and therefore 'the content' is too difficult to determine -- solving the halting problem would be a prerequisite!).
c) It doesn't work well for animations (GIF, WEBM). This may be solvable but AFAIK no research has been done into it.

IMO the reason many websites use MD5 for search (and some for deduplication) is
a) it's pretty fast, compared to other hashing algorithms.
b) Like any generic hash algorithm, it's equally applicable to every possible file format.

Updated by anonymous

  • 1