Topic: [Feature] Include post edits in db_export

Posted under Site Bug Reports & Feature Requests

I imagine there'll be very little interest in this request because of its niche usage, but I want to be able to analyse tag edit data for various purposes such as making graphs and possibly even identifying tag vandalism.

The current method of using post_versions.json only seems to allow getting 75 results at once, meaning I'd have to make 2113 API requests to get my full tag edit history (it actually errors on page 134 and above) which is horribly inefficient and time consuming.

A database dump of post edits at https://e621.net/db_export/ would allow me to process data from all tag edits rather than having to spend hours requesting them via the API.

I will have to look into this. Current estimates suggest that the compressed size of the table dump is probably in excess of 8GB, as the raw table is more than 17GB in size. The folder that the current dumps live in is size limited, and I will have to investigate if it can hold the size that several days of this table backup would consume.

In addition, this table takes a very long time to dump and compress, and I'm not sure if it's worth it for the niche quality involved. A one off dump of this might be possible.

kiranoot said:
I will have to look into this. Current estimates suggest that the compressed size of the table dump is probably in excess of 8GB, as the raw table is more than 17GB in size. The folder that the current dumps live in is size limited, and I will have to investigate if it can hold the size that several days of this table backup would consume.

In addition, this table takes a very long time to dump and compress, and I'm not sure if it's worth it for the niche quality involved. A one off dump of this might be possible.

Thanks for the quick response, it definitely does seem like overkill to do a daily dump for a niche usage, perhaps weekly or monthly would be better. For now I think a one-off dump would be good enough though.

Alternatively could the post_versions.json API be allowed to return more than 75 results? Most users only have a few thousand edits so it shouldn't take too many requests with an increased limit.

Updated

kiranoot said:
I will have to look into this. Current estimates suggest that the compressed size of the table dump is probably in excess of 8GB, as the raw table is more than 17GB in size. The folder that the current dumps live in is size limited, and I will have to investigate if it can hold the size that several days of this table backup would consume.

In addition, this table takes a very long time to dump and compress, and I'm not sure if it's worth it for the niche quality involved. A one off dump of this might be possible.

How does it manage to be more than 10 times larger than the posts themselves? Are you storing every tag in every edit, even unchanged ones, or something?

wat8548 said:
How does it manage to be more than 10 times larger than the posts themselves? Are you storing every tag in every edit, even unchanged ones, or something?

More or less. The system isn't as "smart" as people think it probably is. In order to avoid fetching posts and trying to resolve diffs, it stores a copy of the final tags, and computes differences based on previous versions. Newer versions materialize a list of diffs along with the final tags as well in order to save some time. Storage concerns for this table aren't high, and it doesn't get used directly for search, making it pretty easy to deal with long run.

faucet said:
Alternatively could the post_versions.json API be allowed to return more than 75 results?

To anyone coming here seeing this, most if not all routes accept can return up to 320 results in one request.

  • 1