Topic: An Idea Concerning source files.

Posted under General

Today I had a thought, A thought that might* change the way I browse my favorites on my phone.

I use e621dl to download my favorites into my phone/tablet.

Let's say I wanted to know who drew the piece of art, I would be left
to either memory or signatures. Thinking on that I remembered that mp3 files contain tags and artist information.

What if, We somehow implemented a similar information keeping ability into a jpeg file? (Forget about png apparently)

I would love to with one tap see all from that artist or character. or rating or ect.

using the current database of information we could sync the files using a single file with the hash code to identify each file. simple and small server download honestly... imagine the possibilities of an sdk or api for that...

edit: I learnt the term metadata today. yay.

additional idea: Convert png files (but don't replace the source file as you can store the alt file type along side the original) so that they can be tagged like jpeg files.

Please discuss about this, Remember it's just a random thought ;).

Updated by savageorange

PNGs allow including metadata inside them too, it's just that not many programs support reading/writing that information. However, that's a nice idea

Updated by anonymous

What about md5sums?
Modifying file data in any way will alter the md5sum. This would mean that if you download the file at one time, and then someone modifies the tags of that post, searching for your image on e621 by md5sum will not work.

It also means if you download the same image twice, it probably won't register as a duplicate to a duplicate finder.

For these reasons, I favor the use of tagging solutions that do not modify the files being tagged, like TMSU.
We had a thread about different ways of tagging images a while ago..

EDIT: https://e621.net/forum/show/158390

Updated by anonymous

Probably not too many who shares my thoughts but don't really like "comments" in metadata because you'll get different file versions of the exactly same image, what that would mean here on e6 is that the same exact image could get uploaded an infinite number of times because the md5 checksum would be different for each image (unless that is handled in some custom way). MP3's circulating in audio file databases often gets information added automatically and you may get 10 files with the exact same audio but different metadata. I just don't like duplicates...

That aside, the JPEG standard has a comment marker COM which can be used for text markers. PNG has three: iTXt (UTF-8) and tEXt/zTXt (ISO-8859-1). GIF has XMP support. MKV and WebM has COMMENT (UTF-8). SWF could store such information as well IIRC. However the support for these in gallery software probably varies greatly.

Updated by anonymous

But wouldn't md5 problems be avoided if there were kept calculated md5 of picture stripped with all the tags? If pics were going to have e621 metadata then they have to be stripped from tags after uploading anyway, then md5 can be calculated, compared with others, and stored somewhere. That way direct link would still be constant even when tags are changed.

Updated by anonymous

Granberia said:
But wouldn't md5 problems be avoided if there were kept calculated md5 of picture stripped with all the tags? If pics were going to have e621 metadata then they have to be stripped from tags after uploading anyway, then md5 can be calculated, compared with others, and stored somewhere. That way direct link would still be constant even when tags are changed.

That means that your comp has to treat files from e621 specially somehow (strip the stored tags before md5summing), when calculating md5sums. Which is possible, but seems like it just displaces the problem onto the user...

(it also has no effect on the 'many duplicates' problem, unless you make the duplicate comparing program also strip metadata for only e621 files.)

Personally, I think this problem cannot be properly solved by any kind of in-file tagging -- as in, it's not mathematically possible. It needs a) filesystem-level support and b) consistent preservation of that filesystem metadata, if you want to attach tags without mutating files.

There are a few filesystems that support storing metadata for files. HFS and Ext4 both do AFAIK. This allows the file's checksum to remain independent of its tags.

However, support for preserving this information, especially between platforms (you have a file X with metadata Y on Mac -> you upload it -> someone else downloads it on Windows-> the downloaded file has metadata Y), is extremely spotty.

That is the long explanation of why I think database-backed designs like TMSU and Lightroom are more practical than metadata stored in the file or next to it (in the filesystem) will be, for the foreseeable future.

Updated by anonymous

savageorange said:
That means that your comp has to treat files from e621 specially somehow (strip the stored tags before md5summing), when calculating md5sums. Which is possible, but seems like it just displaces the problem onto the user...

(it also has no effect on the 'many duplicates' problem)

Nah, I was thinking about this:
1. User uploads jpg pic to server. Let's call it picture1
2. Server removes all tags from file - now it's picture2
3. Server compares md5 of picture2 with every pic and check for dups.
4. Server stores md5 of picture2 and names direct link after this md5.
5. After tag change picture2 is changed to picture3, but stored md5 and direct link doesn't change.

Currently user metadata on pics is saved on e621 (few pictures I downloaded had some stupid tags on them.) after this change only pure md5 would be kept to compare so maybe it would resolve some duplicates problem.

I don't know exactly how metadata works, so correct me if I said something stupid.

Updated by anonymous

Granberia said:
Nah, I was thinking about this:
1. User uploads jpg pic to server. Let's call it picture1
2. Server removes all tags from file - now it's picture2
3. Server compares md5 of picture2 with every pic and check for dups.

So, here, if I understand you correctly, you solve the problem of e621 detecting duplicate uploads, that would otherwise be caused by your step 5, successfully.

4. Server stores md5 of picture2 and names direct link after this md5.
5. After tag change picture2 is changed to picture3, but stored md5 and direct link doesn't change.

Then the stored md5 is wrong -- it is the md5 of picture2, but the user gets picture3, which has a different md5. In the future when they search the md5 of their picture3, they will get no results.

That's how metadata is stored -- typically as part of the file.

Certainly, with current internet protocols, if you want metadata, it has to be included in the file's data (or else you have to add it later)

Updated by anonymous

savageorange said:
So, here, if I understand you correctly, you solve the problem of e621 detecting duplicate uploads successfully.
Note that this has no impact on local duplication problems.

Then the stored md5 is wrong -- it is the md5 of picture2, but the user gets picture3, which has a different md5.

Dunno. Since you didn't specify what the user asking for a file gets, it's unclear. If they get a file with tag information included, then the file they get will change each time a person modifies the post tags.

User gets the newest one.
In my example user gets picture3 with file name changed to md5 of file2 which is stored on server.

If I understand correctly OP wants a way to have downloaded images already tagged with e621 tags. That way image is tagged, and it has constant (at least concerning tag changes) md5 on server, and checking for dups on server works if my assumption (described later) about md5 is correct.

For the purposes of this thread, no detailed understanding of metadata is needed; All you really need to know is:

  • metadata is usually stored as part of the file's data.
  • thus, all changes to metadata stored in this way inherently change the file data
  • Any changes to the file data inherently change its checksum.

This is the reason I say that this particular solution is mathematically impossible. You cannot modify the file while preserving the md5sum.

My reasoning depends on assumption that if you have original file with certain checksum and then you first add tags, and then remove that tags you get file with the same checksum. So basically I want during md5 comparison I want to revert each image in database to state without any tags, compare them in that state, and then turn back to state with e621 tags. But instead of reverting files, I just want to kept their md5 when they are stripped from any tags (and if filename counts to md5, then it also can be set as something constant). I the assumption I wrote earlier is correct then it results in comparing just images without any metadata, because metadata of all files was the same when remembering md5. If assumption is not, then yeah - it's not possible.

EDIT:
I've read your edited posts, and realized that I completely forgot about md5: search for file on your PC. Yeah, this will be broken in my suggestion. Still http://iqdb.harry.lu/ should probably be working.

Updated by anonymous

Granberia said:
User gets the newest one.
In my example user gets picture3 with file name changed to md5 of file2 which is stored on server.

Hm, that's one way of sidestepping the md5 incorrectness, I suppose. It depends on the assumption that they will not change the filename.

It's really no different in that case from sticking the post id in the filename.

If I understand correctly OP wants a way to have downloaded images already tagged with e621 tags. That way image is tagged, and it has constant (at least concerning tag changes) md5 on server, and checking for dups on server works if my assumption (described later) about md5 is correct.

E6 server would have capacity to compare it correctly under your scheme, yes.

My reasoning depends on assumption that if you have original file with certain checksum and then you first add tags, and then remove that tags you get file with the same checksum.

This may or may not be true, but code can certainly be written in a way that should ensure it (transform all input file so that all reorderable sections are ordered in one canonical way, and any output files should also be written in conformance with that canon)

So basically I want during md5 comparison I want to revert each image in database to state without any tags, compare them in that state, and then turn back to state with e621 tags. But instead of reverting files, I just want to kept their md5 when they are stripped from any tags (and if filename counts to md5, then it also can be set as something constant).

FWIW only file data effects md5sum (or other checksum / hashes like sha1, sha256 ..). That is, the literal bytes comprising the file (which does not include filename, modification time, etc, but does include file-type-specific stuff like the EXIF tags discussed here)

I the assumption I wrote earlier is correct then it results in comparing just images without any metadata, because metadata of all files was the same when remembering md5. If assumption is not, then yeah - it's not possible.

Your assumption is correct, I think.
It provides a reasonable guarantee that e6 will detect duplicates properly.

It does break local duplicate detection (since you can't rely at all on the user not renaming the file, and each modification of the image tags on e621 creates a different file download)

It also breaks the semantics of md5sums.
If I have a file I got from e6, on my disk, that I have not changed in any way, and that post exists on e6 still...
then I should be able to take the md5sum of that file -- the -actual- md5sum of that file, not the md5 of the stripped version you are proposing to use as the filename -- search for it on e621, and get a result.
If I cannot do that, then what is being stored by e621 is actually not the md5sum, it's just a serial number or 'access code'.

EDIT: Yes, since iqdb.harry.lu searches based on image pixels not checksum, it would still work fine AFAICS.

I see you acknowledged my last point in your edit.

Well, this is a real sticking point for me. If we don't have md5sum's, then we don't have them, okay, whatever, sure. But if we do have them, we need to make sure md5: can work in a sensible way (which, AFAICS, means that tag, source, etc, edits to a post on e6 must not modify the download that a user gets in any way.). Otherwise we create confusion about e621, and confusion about md5sums.

Updated by anonymous

savageorange said:
It also breaks the semantics of md5sums.
If I have a file I got from e6, on my disk, that I have not changed in any way, and that post exists on e6 still...
then I should be able to take the md5sum of that file -- the -actual- md5sum of that file, not the md5 of the stripped version you are proposing to use as the filename -- search for it on e621, and get a result.
If I cannot do that, then what is being stored by e621 is actually not the md5sum, it's just a serial number or 'access code'.

Then how about adding option in user setting to have server strip tags from file before downloading?
If option is checked, and my assumption is correct then you get picture2 because picture2 + tags - tags = picture2. You have the file which md5 is stored in server, but you don't have tags.
If option is not checked you get pic with tags.
Though I don't know whether it can be efficiently implemented (storing 2 different files for each post sounds bad)

Updated by anonymous

Granberia said:
Then how about adding option in user setting to have server strip tags from file before downloading?
If option is checked, and my assumption is correct then you get picture2 because picture2 + tags - tags = picture2. You have the file which md5 is stored in server, but you don't have tags.
If option is not checked you get pic with tags.
Though I don't know whether it can be efficiently implemented (storing 2 different files for each post sounds bad)

I could get behind this, if there was a solution that worked for all filetypes. It lets the user opt in rather than forcing a scheme which most people will not care about.

Storage wise you'd store only the stripped file, and at the time a user requests a download, you would read the file, rewrite it to include tag info, and send it down the pipe. If the download needs to be resumable, you'd probably have to write the file temporarily into a cache area (where it would expire after, say, 3 hours) and the server code would serve that cached file.

Also, I just thought, if we are optionally adulterating images with tags, we have the ability to include special tags like 'md5:' or 'id:' to specify explicitly how to search the original post. Though this still breaks semantics of md5s, it does it explicitly where the user can see it, so it's less bad.

Overall though, is it naive of me to think that solutions like TMSU or Lightroom are best because they avoid this complexity?

Updated by anonymous

savageorange said:
Overall though, is it naive of me to think that solutions like TMSU or Lightroom are best because they avoid this complexity?

OP is mentioning using phone. Is any of this solution free and available on all mobile platforms?
I was thinking about using TMSU myself but if I understand correctly there's no virtual filesystem support for windows now, which makes it rather useless for browsing images. I use windows live gallery, but I suffer from lack of option to exclude tags more and more. It turned out that sometimes I just don't want to see mlp pics.

Updated by anonymous

Granberia said:
OP is mentioning using phone. Is any of this solution free and available on all mobile platforms?

Probably not. Tagging isn't really a friendly idea for phones IMO.

I was thinking about using TMSU myself but if I understand correctly there's no virtual filesystem support for windows now, which makes it rather useless for browsing images.

Oh?
I never use the VFS, personally -- I just pipe the output of 'tmsu files QUERY TERMS HERE' into my favorite image viewer.(why yes, I have a terminal open all the time, why do you ask ;)

But yeah.. (checks git log) .. VFS is currently available on Linux and Mac, but not Windows.

I have a tmsu:// protocol implementation, eg clicking on a link to "tmsu://q/rabbit" opens the result of the TMSU query "rabbit" in an image viewer (I made this because I wanted to link to queries in Zim Wiki)

It doesn't require VFS support, but some work is probably needed to get it running on Windows -- mainly, I'm using notify-send to popup system notifications all over the place, which is a Linux thing. (I also don't particularly know how MIME associations are set up on Windows, which is what causes the link tmsu://q/rabbit to be recognized by the 'tmsu://' and sent to tmsu-schemahandler for interpretation)

If that sounds interesting to you I'm happy to help getting it going.

I use windows live gallery, but I suffer from lack of option to exclude tags more and more. It turned out that sometimes I just don't want to see mlp pics.

Yeah, from everything people say about live gallery, it seems like it's rather inconsistent and hacky in the way it deals with tagging.

Updated by anonymous

  • 1