Since September 14th, I have been working through Twitter-sourced uploads using the search parameters source::large order:score to find and replace inferior posts with the "orig" versions. During that time, I have noticed that at least 80% of the highest-rated uploads were not identical with their direct link sources. My tag history lists all instances of changing these sources to a matching URL with a corresponding reason.
Boolean image comparison using Idem's sourcing scripts found that the uploaded images were visually identical to the "orig" versions and not the "large" versions, despite claiming to use the large versions in the source fields. I also compared the byte sizes of these files, which confirmed that the versions uploaded to e621 were indeed identical to the orig Twitter versions despite their sourcing claiming to contrary. Additionally, checking the source history of posts confirmed that the direct link URLs were not modified since the images were uploaded.
That leaves the question: Why were source mismatches so frequent?
Did a statistically significant portion of users somehow manage to acquire the "orig" Twitter versions and then source their posts as "large"? Or did Twitter - on purpose or by mistake - shift the previously large versions of images to the orig URL and compress or modify the image further to generate a new large version?
Given the latter case, I worry that many true "orig" versions of art from Twitter may have been lost forever. Moreover, I fear that there may have been a glitch that swapped image files between the large and orig URLs. The orig versions of images in the same resolution have consistently had a slightly smaller file size, lending credence to the concern, though that may also be the result of extra compression generating additional bytes. In most of these cases, my eyes cannot discern the visual differences enough to determine which Twitter version between the two is technically "superior" beyond the now-questionable authority of the direct link source.
Has anybody else experienced these strange source mix-ups? If so, I would like to determine what the root cause of these mismatches are and if they have any relevance to our archiving efforts. Any thoughts, information, or credible articles for reference would be appreciated.
Updated by Mairo