Topic: How should I go about updating the metadata of lots of content?

Posted under e621 Tools and Applications

Hi,

I have built a tool which allows me to archive lots of content to a local server from e621, using the API. Getting this content within the API rate limits is easy as I can get the required metadata (tags, parent/children, description, hashes, filetypes etc) from just calling https://e621.net/posts.json

This has been really successful, however I now have over 80K pieces of data, and I know that a significant amount of this data will have changed (deletions from e621, tag changes, description changes, parents and children added/removed)

What would be the most efficient way to update this data whilst keeping in rate limits? Making over 80K API calls at 1 per second to keep within limits is obviously not realistic. It is not possible for me to retrieve all data in one big search (don't think so, anyway) as the data came from multiple different searches. I still have a record for what these searches are, so I could theoretically just run the full searches again and update data, but I wish to start adding individual records outside of searches too so this method will not work long term.

Thanks!

I was going to suggest querying the post changes list, perhaps filtering by "final tags" or whatever your original queries were, but a) I'm not sure if that would be any more efficient in practice and b) it doesn't appear to have an API endpoint anyway.

Not using the API would be the fastest way to fully update the metadata on your collection.

https://e621.net/db_export/

I have not finalized these, so the format, fields, and how frequently they are produced, as well as the history length are not guaranteed at this time.

You will only need the latest version of the posts file to complete your task.

Pup

Privileged

wat8548 said:
I was going to suggest querying the post changes list, perhaps filtering by "final tags" or whatever your original queries were, but a) I'm not sure if that would be any more efficient in practice and b) it doesn't appear to have an API endpoint anyway.

You can do something similar with order:change for a program to go through all of them you could do order:change_asc change:>X with X being the highest change value of the last page you checked.

user_873940 said:
What would be the most efficient way to update this data whilst keeping in rate limits? Making over 80K API calls at 1 per second to keep within limits is obviously not realistic.

It's not efficient, but an SQLite DB with the metadata of every post takes up about 2GB, so you could keep a full copy up to date using the change value.

Alternatively, and a lot more efficiently, if you voted up every one of those 80k posts, using the API to make it faster, you could do votedup:me id:<X to then go through those posts to keep them updated. At 320 posts per page for 80k posts it'd take 250 API calls, or about 5 minutes to update everything. For extra efficiency you could also use the change value to only update the posts that have changed since you last checked as well, something like votedup:me order:change_asc change:>X.

(Edit: Kira's method's better, I started typing my reply before they posted theirs)

Updated by NotMeNotYou

Pup

Privileged

kiranoot said:
Not using the API would be the fastest way to fully update the metadata on your collection.

https://e621.net/db_export/

I have not finalized these, so the format, fields, and how frequently they are produced, as well as the history length are not guaranteed at this time.

You will only need the latest version of the posts file to complete your task.

O_O Nice! I didn't realise that was a thing.

kiranoot said:
Not using the API would be the fastest way to fully update the metadata on your collection.

https://e621.net/db_export/

I have not finalized these, so the format, fields, and how frequently they are produced, as well as the history length are not guaranteed at this time.

You will only need the latest version of the posts file to complete your task.

Absolute ledgend. This solved so many problems! Thank you so much!

Hi there,

I just got round to looking at this is more detail and have a few more questions.

Does the latest copy of the posts CSV contain everything from forever? This would allow me to make historical searches too.

What would the limit on downloading this be? My idea would be to download each one every day for the latest records and then run checks against it and my own database automatically, then use the API if I need to query things newer than the database copy.

Thanks!

user_873940 said:
Does the latest copy of the posts CSV contain everything from forever? This would allow me to make historical searches too.

It contains the information for every post on the site currently. Which looks like: https://i.imgur.com/RMYG8Lp.png and https://i.imgur.com/xlSTS90.png

user_873940 said:
What would the limit on downloading this be? My idea would be to download each one every day for the latest records and then run checks against it and my own database automatically, then use the API if I need to query things newer than the database copy.

It only gets updated once a day, and downloading it once a day from https://e621.net/db_export/ won't incur any rate limits.

Updated by NotMeNotYou

lightsbane said:
It contains the information for every post on the site currently. Which looks like: https://i.imgur.com/RMYG8Lp.png and https://i.imgur.com/xlSTS90.png

It only gets updated once a day, and downloading it once a day from https://e621.net/db_export/ won't incur any rate limits.

Thanks for the info! The every post thing was my main one, and although I know no ratelimits will be incurred, my main reason for asking would be more a "will they be later" sort of thing, for I would like to base part of my application on the idea the database will be updated daily, and the API to as a live update should that be desired.

user_873940 said:
Thanks for the info! The every post thing was my main one, and although I know no ratelimits will be incurred, my main reason for asking would be more a "will they be later" sort of thing, for I would like to base part of my application on the idea the database will be updated daily, and the API to as a live update should that be desired.

It seems improbable as it's just a static file (although for other static files it's been asked to keep requests below 5 per second, however, I'm not sure how strictly that's enforced). As far as the API goes there is an enforced rate limit of one request per 500 ms.

Updated by NotMeNotYou

lightsbane said:
It seems improbable as it's just a static file (although for other static files it's been asked to keep requests below 5 per second, however, I'm not sure how strictly that's enforced). As far as the API goes there is an enforced rate limit of one request per 500 ms.

Huh. That's changed? It was 1 second before

Edit: It is recommended 1 second over prolonged periods which is what I'm doing. That makes sense

  • 1