Topic: Twitter sources: Consistent mismatches between direct image links and uploaded images

Posted under General

Since September 14th, I have been working through Twitter-sourced uploads using the search parameters source::large order:score to find and replace inferior posts with the "orig" versions. During that time, I have noticed that at least 80% of the highest-rated uploads were not identical with their direct link sources. My tag history lists all instances of changing these sources to a matching URL with a corresponding reason.

Boolean image comparison using Idem's sourcing scripts found that the uploaded images were visually identical to the "orig" versions and not the "large" versions, despite claiming to use the large versions in the source fields. I also compared the byte sizes of these files, which confirmed that the versions uploaded to e621 were indeed identical to the orig Twitter versions despite their sourcing claiming to contrary. Additionally, checking the source history of posts confirmed that the direct link URLs were not modified since the images were uploaded.

That leaves the question: Why were source mismatches so frequent?

Did a statistically significant portion of users somehow manage to acquire the "orig" Twitter versions and then source their posts as "large"? Or did Twitter - on purpose or by mistake - shift the previously large versions of images to the orig URL and compress or modify the image further to generate a new large version?

Given the latter case, I worry that many true "orig" versions of art from Twitter may have been lost forever. Moreover, I fear that there may have been a glitch that swapped image files between the large and orig URLs. The orig versions of images in the same resolution have consistently had a slightly smaller file size, lending credence to the concern, though that may also be the result of extra compression generating additional bytes. In most of these cases, my eyes cannot discern the visual differences enough to determine which Twitter version between the two is technically "superior" beyond the now-questionable authority of the direct link source.

Has anybody else experienced these strange source mix-ups? If so, I would like to determine what the root cause of these mismatches are and if they have any relevance to our archiving efforts. Any thoughts, information, or credible articles for reference would be appreciated.

Updated by Mairo

The answer you seek is most likely the simplest answer: People already did what it is you're doing, but didn't bother to update the source.

Updated by anonymous

Did you only check that the MD5 and/or visual quality of large uploads matched the orig source?

Because I have no fucking idea how twitter sometimes operates, but there are cases where large and orig are completely identical files. In these cases I do simply update the source to be orig so that it doesn't show up in sample searches.

I have also seen users who download the files to local memory and then manually add the sources, so they might just right click and copy image source on the page, while download is orig.

Changing the filetype will be bad news as you will get stuff like JPG files saved as PNG files and in couple cases the large samples MD5 has changed over time, where orig has stayed the same. Hence why in sites and sources, there's written, to always get orig version to avoid all of this headache. I don't think I have seen the case where orig file has changed it's visual fidelity or hash over time though, so from that standpoint I would not be too worried.

Also like you said yourself, twitter is pretty compressed no matter what, unless the file is actual PNG which is even rarer these days. It is not a good source at all, but it's still somehow the best from possible choises or only one. When there's enough compression, determining which version is the more compressed one gets really hard and requires some knowledge what to look for and the increase in fidelity is so small that most do not bother working with these.

Updated by anonymous

Strongbird said:
Or did Twitter shift the previously large versions of images to the orig URL and compress or modify the image further to generate a new large version?

I always assumed that for small enough images, :large was identical to :orig. Twitter recently changed the image URL's, you maybe right that many originals are now inaccessible. It's Tumblr all over again.

Updated by anonymous

leomole said:
I always assumed that for small enough images, :large was identical to :orig. Twitter recently changed the image URL's, you maybe right that many originals are now inaccessible. It's Tumblr all over again.

Large is sometimes the same as orig, but not always and there doesn't appear to be any consistant rule to when this is or isn't.
Just. Always. Get. Orig.

Also the new format adds stuff like more samples (4096x4096, 900x900) and formats (webp) but other than that it operates pretty much identically to legacy format (e.g. .jpg:large=?format=jpg&name=large, .jpg:orig=?format=jpg&name=orig, etc.).

One thing I did notice with post #1917518, is that MD5 of legacy large, legacy orig and modern orig are all same, 4DED6E0EE8390F98AFA0360177FAADC7, however modern large is differend, 3BBAB950ECD9B17D35728A9553C3E286 and also has visual downgrade. So it could be that with modern format, large sample is never good, when earlier it was hit or miss to begin with.
https://puu.sh/ErcTy/b875697dc3.png

Updated by anonymous

Anonomn said:
The answer you seek is most likely the simplest answer: People already did what it is you're doing, but didn't bother to update the source.

I verified manually that none of the posts in question were reuploads prior to creating this thread, so the simplest answer is ruled out.

Mairo said:
Did you only check that the MD5 and/or visual quality of large uploads matched the orig source?

I manually looked over each image, used Boolean image comparison from Idem's sourcing scripts, and directly compared byte "Size" and "Size on disk" fields. I did *not* cross-reference MD5 hashes, although I could and should have as an additional measure.

Using Idem's tool, I visually cross-checked the following permutations (using old format since I didn't receive the Twitter update yet):
e621 upload - Twitter :large
e621 upload - Twitter :orig
Twitter :large - Twitter :orig

80% or more of these cases were "e621 upload" and "orig" versions matching, but the "large" version matching neither visually nor in byte size, despite the source having listed "large" since upload.

Mairo said:
I have also seen users who download the files to local memory and then manually add the sources, so they might just right click and copy image source on the page, while download is orig.

That's a possibility, but I am skeptical that a significant majority are taking that approach. Most people I know right-click and save the image to storage directly (HDD or SSD).

Mairo said:
One thing I did notice with post #1917518, is that MD5 of legacy large, legacy orig and modern orig are all same, 4DED6E0EE8390F98AFA0360177FAADC7, however modern large is differend, 3BBAB950ECD9B17D35728A9553C3E286 and also has visual downgrade. So it could be that with modern format, large sample is never good, when earlier it was hit or miss to begin with.
https://puu.sh/ErcTy/b875697dc3.png

Thank you for the additional sleuthing and comparison with labels. However, using image_comparison, Idem's tools, and checksum testing (on old format Twitter), I encountered results that don't match the MD5 differences you listed on post #1917518

Checksums:

4ded6e0ee8390f98afa0360177faadc7 - e621 upload
3bbab950ecd9b17d35728a9553c3e286 - legacy large
4ded6e0ee8390f98afa0360177faadc7 - legacy orig
3bbab950ecd9b17d35728a9553c3e286 - new large
4ded6e0ee8390f98afa0360177faadc7 - new orig

Visual comparison:
e621 upload v. legacy large
e621 upload v. legacy orig
e621 upload v. new large
e621 upload v. new orig
legacy large v. new large
legacy orig v. new orig

File size comparison:
e621 upload - legacy large - legacy orig
new large - new orig

By these metrics:
- The e621 upload, legacy orig, and new orig versions have identical checksums
- The legacy large and new large versions have identical checksums
- The e621 upload is visually identical to the legacy and new "orig" versions
- The e621 upload has the exact same visual differences between the legacy and new "large" versions
- The legacy and new "large" versions are identical visually and in byte size
- The legacy and new "orig" versions are identical visually and in byte size

Why the old format's large version is identical to the orig versions for you, I do not know. However, it's clear that there's something going awry here or being changed based on region - possibly a caching issue either locally or with the nearest Twitter server.

Important edit

A programmer friend suggested that people may have been acquiring the "orig" versions of images through "large" URLs due to caching issues.

If so, then this weird behavior boils down to a server fault on Twitter's end, rather than a concrete bug or later extra compression of existing image uploads. The most likely reasons these uploads were marked as "large" direct links is due to users downloading the "orig" version's data from "large" direct links without realizing it. It'd be amusing that a flaw on Twitter's end helped those users avoid a BVAS, but it's also a sensible explanation.

Updated by anonymous

Strongbird said:

Important edit

A programmer friend suggested that people may have been acquiring the "orig" versions of images through "large" URLs due to caching issues.

If so, then this weird behavior boils down to a server fault on Twitter's end, rather than a concrete bug or later extra compression of existing image uploads. The most likely reasons these uploads were marked as "large" direct links is due to users downloading the "orig" version's data from "large" direct links without realizing it. It'd be amusing that a flaw on Twitter's end helped those users avoid a BVAS, but it's also a sensible explanation.

Considering that the behavior has seemed fully random, this would make most sense.
However there is still the case of going trough all these large uploads and checking they do actually have same file that orig has and change the source to correspond this, which is still more work, so making sure users do upload orig is still most crusial point of all.

Updated by anonymous

  • 1