Topic: e621 API Mirror (Alternative API)

Posted under e621 Tools and Applications

If there was an alternative API, one without a rate limit and was much faster, with the caveat of only being as much as a day behind the regular e621 API (as most mirrors are), would you use it?

Please respond with why or why not, or what it would need to be able to do to make up for the delay in data.

NOTE: This would only apply to public data, but would include user favorites for those who have chosen to make theirs public.

Updated

Having a rate limit would still be a good thing for you, even if it's lower than that of e621.

I presume that you intend to get data from the db_export, which is not a terrible idea.
Being able to bypass the limitations of the native API comes in very handy.

I made an API for tags using that data in order to power my Advanced Tag Search
https://esix.bitwolfy.com/tags/
It allows to fetch alias and implications data in a single request, and to use regular expressions in searches (ex. finding tags that start with the word 'wolf')
But the performance is often... inconsistent, I am afraid. Especially with regular expressions.

Having an alternative API for the posts would also be useful, but that is not something that I personally got to just yet.
For example, being able to get an (approximate) number of posts in an arbitrary search is something that I would really like to see. Or being able to easily get image URLs from a list of post IDs – that would be fantastic.

bitwolfy said:
arbitrary search

Yes, I am getting most of the data from db_export, however, I've worked on rebuilding the data into a format to avoid quasilinear operations, which is why I've said there's "no rate limit", as I can handle a significant amount of requests without worrying about overloading my 'db'. There is a failsafe queue function in the event of an overload, and another to prevent "spam" requests. Upon release, I'll go into that in detail as it may change during development.

Performance has been quite consistent thankfully in initial testing, and I will likely hold a closed beta to verify that It will indeed stay consistent under load.

As far as "being able to get an (approximate) number of posts in an arbitrary search" goes, would you mind providing some examples of what you would ideally like this to look like?

Updated

lightsbane said:
As far as "being able to get an (approximate) number of posts in an arbitrary search" goes, would you mind providing some examples of what you would ideally like this to look like?

Nothing complicated. Simply being able to send a request to something like /count.json?tags=male+solo+horse, and get the number of posts with those tags in return.

I have two specific use cases for this.

1. My userscript / extension overhauls the /posts page output.
Among other things, it displays an approximate number of posts in a specific search. For example, searching for elder_scrolls solo male shows that there are approximately 800 results: https://i.imgur.com/RztblzI.png
That number is achieved by multiplying the number of pages (from the pagination) by the user's preferred number of posts per page. However, the resulting number is not exact, unless the user is on the last page of search. I would prefer to output the exact post count, even if it is a little bit off.

2. I recently made (but not released yet) a utility that lets users download all posts from an arbitrary search. I would like to show an approximate number of posts that will be downloaded beforehand.
Unfortunately, unlike with the first example, this utility is hosted on an external website. It is not possible to get the total number of result pages through the API without iterating through all of them first.

bitwolfy said:
Nothing complicated. Simply being able to send a request to something like /count.json?tags=male+solo+horse, and get the number of posts with those tags in return.

I'll definitely work on a solution for this. At the moment, running that query ('male+solo+horse') takes ~18 seconds and results in 61410. However, I have a rough idea of how to decrease the processing time.
The second one ('elder_scrolls+solo+male') took ~19 seconds and resulted in 4268, in case you were wondering.

As far as "Or being able to easily get image URLs from a list of post IDs" goes, I've already implemented and tested this. Although there may be a hard limit as to how many IDs you can specify in a single query (around 100, although this may change after load testing).

lightsbane said:
I'll definitely work on a solution for this. At the moment, running that query ('male+solo+horse') takes ~18 seconds and results in 61410. However, I have a rough idea of how to decrease the processing time.
The second one ('elder_scrolls+solo+male') took ~19 seconds and resulted in 4268, in case you were wondering.

That's strange.
Even with deleted posts (which I'm guessing your API counts by default), elder_scrolls solo male status:any has exactly 844 posts.
Mean while, male solo horse status:any has 12292 posts. You can check that for yourself – the latter has 123 pages of 100 posts, for example.

Are you sure that you don't have an error in there somewhere?

lightsbane said:
As far as "Or being able to easily get image URLs from a list of post IDs" goes, I've already implemented and tested this. Although there may be a hard limit as to how many IDs you can specify in a single query (around 100, although this may change after load testing).

Well, e621 API's limit is 100 as well ¯\_(ツ)_/¯

bitwolfy said:
That's strange.
Even with deleted posts (which I'm guessing your API counts by default), elder_scrolls solo male status:any has exactly 844 posts.
Mean while, male solo horse status:any has 12292 posts. You can check that for yourself – the latter has 123 pages of 100 posts, for example.

Are you sure that you don't have an error in there somewhere?

I'm glad you brought this up. The query is just matching values in tag_string. My regex statement didn't account for the tag needing to be surrounded by whitespace (I blame that oversight on vodka).

bitwolfy said:
Well, e621 API's limit is 100 as well ¯\_(ツ)_/¯

Some preliminary tests show I could make it the limit as low as 500. However, I still need to do more load testing. It's a good sign though.

Depending on how you have this set up you should look into setting up a GIN index in PostgreSQL with a custom parser that doesn't modify words. A keyword full text index for MySQL, or for something else, using an inverted index will make your searches lightning fast. We're using elasticsearch on our end, but due to licensing changes I cannot suggest use of that right now. If your searches are taking longer than a few milliseconds then they are going to quickly kill your server under any load.

kiranoot said:
Depending on how you have this set up you should look into setting up a GIN index in PostgreSQL with a custom parser that doesn't modify words. A keyword full text index for MySQL, or for something else, using an inverted index will make your searches lightning fast. We're using elasticsearch on our end, but due to licensing changes I cannot suggest use of that right now. If your searches are taking longer than a few milliseconds then they are going to quickly kill your server under any load.

I'm now using Percona, although I was initially using just MySQL. Which has already decreased query times I previously mentioned from 21 seconds to about 3. Fetching records or groups of records (by uploader id for example), takes absolutely no time at all. So I'm sure the inverted index will bring that time down even further.

Update:

Note: I am open to suggestions/recommendations/constructive criticisms.

After about a month, this is where I am at roughly. I may have not included some things in here, however, here is the TLDR:
I have a working Alpha. I have built and deployed the endpoint. Hopefully, in a few weeks, I'll have a beta for people to try/give feedback on. There is still quite a lot of functionality I've yet to add, and I'll periodically post updates here.
Also, most of my time has been spent on methods to keep my DB updated (this is expounded upon later in this post) vs working on the actual functionality. However, once that's complete, things should move much quicker.
TLDR over.


General:

- Automated updating from /db-export/. I also have a partially working update pipeline that could keep my mirror only about an hour behind e621's when it comes to posts (and only posts). I've tested this (successfully) somewhat, however, there are some complications with data consistency on my end that may not make it worth it (eg. content in a***.***/posts/13.json may not match the results of (and be more up to date than) a***.***/api/post?id=13&all=true) among other things. I will continue to work on it for now.
- I've taken the advice from KiraNoot and now my queries are, as they so eloquently put it, "lightning-fast".
- Currently working on organizing the tags into their specific groups. (thanks to Earlopain for how to do that)
- I've been doing my best to exactly mimic the e621 API where I can so that this could be "plug and play" so to speak for certain functionalities.
- I'm settling on a rate limit of one request per 100ms. (and perhaps a paid tier for image caching/proxy downloads [this equates to faster batch downloads however scalability is a glaring issue here for many reasons so it may not be feasible] and faster overall speeds).

Current state:
Functionality in e621's API and also in this API:
- Get /posts/<post_id>.json
- Get /posts/<post_id>/votes.json
- Get /pools/<pool_id>.json
- Get /wiki_pages/<wiki_id>.json
- Get posts by uploader_id (Paginated)

Functionality in e621's API but improved to some degree:
- Get a list of image URLs from up to 200 posts (with the ability to specify between previews and full images). The call for this is just: a***.***/api/images?batch=1&id=0001%0002%0003%...%0200 which returns a JSON array of the post id's and their URLs.
- Get posts from a date range (by updated_at or uploaded_at) (Paginated).

Functionality not in the e621 API (to my knowledge) but currently working to some degree in my alternative API:
- Get approximately the number of posts from a tag query. (this is not fully working yet).

Potential future functionality
- If this doesn't get added to the API, I may consider implementing it at some point. https://e621.net/forum_topics/30129
- https://e621.net/forum_topics/29486

Updated

lightsbane said:
Get posts from a date range (by updated_at or uploaded_at) (Paginated).

You can search with the date syntax, see the search cheatsheet for more. Only for uploaded date though.

lightsbane said:
- Get posts by uploader_id (Paginated). (this one may be in the e621 API if it is tell me and I'll move it under 'Functionality in e621's API and also in this API')

This is possible by name and id when using user and user_id.

  • 1