E621 Advanced Search
This project has been long in the making. I've made this to significantly enhance e621's search syntax, in my opinion.
E621 Advanced Search is primarily a tampermonkey userscript to make searching on e621 better.
Download guide:
- Removed, the script doesn't function, and I have no intentions on fixing it, but the syntax is still used on my site, so I will be keeping this up.
Other Links
RE621
The userscript is untested with RE621. It might work, it might not. Or it might break some things. Use them together at your own risk.
Note: Due to the way pagination works, I've switched to an infinite scrolling post gallery, rather than actual pages. This only happens when you search with advanced search.
E621 Advanced Search adds many things e6's base searching lacks, but it comes with trade offs (see Caveats). The main enhancement is the ability to use tag groups with parentheses allowing you to match different groups outside of other groups.
For example: duo ( male ~ female ) ( ambiguous_gender ~ gynomorph ) will match posts containing a male or female and an ambiguous_gender or gynomorph, this will not match posts that have a male and female, or ambiguous_gender and gynomorph since it also searches for duo. Of course this relies on tags being correct, so please tag posts!
E621 advanced search also supports groups inside of groups, but this is much more complicated and only for the power users. You can use ~ between groups, as well.
Like so: duo ( male female ) ~ ( ambiguous_gender gynomorph ) will only match posts with a male and a female, or a ambiguous_gender and a gynomorph, and since duo was used, we can assume that posts that contain a male will also not contain an ambiguous_gender, and vice versa (and the other two tags).
Like e6, you can also use - to negate the next tag or group -female means the post won't have a female, and -( female male ) means the post won't have a female and male, but could have either of them by themselves.
Here's an in depth example with many different combinations:
Example
Notes:
- Group parentheses should always be preceeded and succeeded by a space unless the parenthesis is the first or last character in the query, or the preceeding character is a - (see below)
- Don't do: (cat dog)
- Do: ( cat dog )
- ~ should always be between two tags, it should never be directly next to a tag
- Don't do: ~cat ~dog
- Do: cat ~ dog
- - SHOULD be directly preceeding a tag
- Don't do: - cat
- Do: -cat or -( cat dog )
- There is a max complexity, that you probably won't hit, but if you do, just know, I'm not changing it
- Wildcards are the most complex due to the way they work
- I'm not sure what the max tags are, it's not something I've hard coded, but you will eventually hit the above complexity cap
In case you're wondering how powerful this syntax is, I offer these queries (they're weird, what can I say):
Query 1
Expanded: female ( presenting ~ seductive ) ( ( rear_view ( panties_down ~ panties_aside ) ) ~ ( upskirt ) )
Simplified: female presenting ~ seductive ( ( rear_view ( panties_down ~ panties_aside ) ) ~ upskirt )
I personally prefer expanding my queries, but it doesn't matter, they do the same thing.
This query will match all posts containing a female either seductive or presenting, and:
- a rear view with panties down or aside;
- OR upskirt
This kind of query is impossible with the built in search, and showcases the power of the syntax of tag grouping with the or logic.
Query 2
Expanded: animated ( ( irrumatio ~ face_mounting ) ~ ( oral pov ) )
Simplified: animated irrumatio ~ face_mounting ~ ( oral pov )
This query returns all animated posts that has:
- irrumatio or face mounting;
- OR pov oral
This query is good at showing the power of the syntax using grouping with the default and logic.
Additions
Any additions that advanced search has over the normal search functionality will be listed here.
- It is now possible to both search and order by non-implicated tag count. This means you can search for low tag posts while ignoring implications. Mainly useful for the visual tagger
- You can use tltagcountgen, tltagcountart, tltagcountcopy, tltagcountchar, tltagcountspec, tltagcountinv, tltagcountmeta, and tltagcountlore to search the specific tag categories.
Caveats
When using e621 advanced search, you'll notice a few differences:
- Blacklisting by uploader username no longer works
- I don't save user data anywhere. It's impossible to efficiently grab it on the client. So this feature will not work
- I recommend blacklisting by uploader id instead anyways as it is better
- Searching by any username related field no longer works, this goes for approver (approver:), uploader (user:), favorites (fav:) etc
- Posts will show as not being favorited, even if they are
- There's no real good way to get all favorites of a user without it taking forever, so it's generally just not worth it to do so
- Use the id versions if they exist. You can add the ! if you want, but it will be ignored and all versions will be treated as if they were ids
- Searching for votes (voted:)/commented on by (commenter:)/noted updated by (noteupdater:)/deleted by (deletedby:) of a specific user is no longer possible
- Searching by description (description:)/note (note:)/delete reason (delreason:) is no longer possible
- Searching for posts pending replacements (pending_replacements) is no longer possible
- Searching for posts in any pools (inpool:)/sets (inset:) are no longer possible
- You can however search by specific pool (pool:)/set (set:) using ids, as in pool:1 or set:1 as posts in these are fetched on the client and OR'd into the search query ( id:1 ~ id:2 ... )
- They aren't in the database export
- inpool is technically possible, but I have yet to find a use that you'd be looking for posts in any pool, and not a specific one
- Ordering by comment_bumped is no longer possible
- e621 doesn't return comment bump in the api, and it's not present in the database export
- Ordering by changed is no longer possible
- change_seq is not available in the database export, and fetching all posts to get it is expensive
- You can order by updated, however, but this will include any post updates
- Order tags are always top level regardless of where they appear in the query
- This does mean you can order by multiple things, however. Default order is by id. There is no limit on the amount of order tags you can use
- If order:random is present, all other order tags are ignored
- If order:rank is present, the rank score will always be the first order factor
- File sizes are now exact, use the range operator (..) to define a range if necessary
- -status:deleted is assumed in every search unless specifically present somewhere in the group, or parent groups. This does not look at child groups, so if you want deleted posts inside of a group, put it at the top most where it could apply.
- All sort values are assumed descending when using order:, use _asc to sort ascending
- Sometimes deleted posts are missed and will still show in the posts, these should last no longer than a few minutes
- Currently all date related tags must be in ISO format
- Eventually I will make a parser to do what e621 does with it's "yester" and "ago" like syntaxes, but at a later date
These caveats are why when using this script, you have the option to also search normally. On the home page and posts page, there is now an additional button that will say "Search normal," when clicked, this will execute a normal search, without going through the extension.
With all that out of the way, let me know what you think! If you find any issues, send them here or on the github. Mainly if some tag doesn't work, or if a post doesn't show up that you expect to, or if a post is showing that shouldn't.
Any feedback is welcome!
Other Projects
yiff.today - Online e621 slideshow
https://yiff.today, an online slideshow viewer for e621 with some pretty nice features, if I do say so myself. Uses this search api and syntax.
Check out the post here: https://e621.net/forum_topics/40665
Visual Tagger
https://yiff.today/visualtagger, lets you tag posts with a similar UI to the slideshow viewer, but also lets you view implications right next to the image itself.
Has plenty of useful features, like:
- Redoing changes from a previous post, mainly for quickly tagging alternate images
- A quick identification of low post count tags
- You can see what tags are implicated when you add a tag, and all of them will show in the menu
- And more, read the post for a more complete feature list
Check out the post here: https://e621.net/forum_topics/40904
Technical doohickies
While the code is open source here, and you can fully run this locally, what if you just want to know what's going on under the hood without having to actually look at the code.
This is a mixed level overview of how this works, including how I get posts, update them, etc.
The beginning
Database exports
We start off with the database exports. When first starting the system, you won't have any data, and fetching all of it from the api would take hours (latest post id: 4350449, max of 320 results per page = 4350449/320 = ~13,596 api requests at a rate limit of 1 request per second = 13,596 seconds = ~3.8 hours, not including stuff that's purged from the site, so the actual time will be a bit less) by the time you're done processing it, all of the data will be massively outdated, and you're going to be playing catchup, so instead we just process the database exports.
The exports this system processes are the post exports, tag exports, and tag alias exports. All exports are processed in batches of 10,000 to speed things up.
Tag exports
The tag export is processed first. This is because the posts export relies on having tags inside the database already to resolve the tag string to tokens. This takes a few seconds at most as it's a small file, and only 3 fields are saved, the id, the tag name itself, and its category.
As of November 2nd, post count and updated at are also saved.
Post exports
The post exports takes the longest to process. Not only is it the biggest, but it's also the most complex.
To save storage, post tags are saved as their id, rather than their actual words. This is because numbers are faster to query than searching an entire string for a substring. Instead, we search an array for a number.
However, to save the category of each tag without having to rebuild it, posts tags are saved twice, one time in a flattened array, and the other in a 2d array, where the index is the id of the tag's category. While this is slightly space inefficient, it's worth the time save when rebuilding the tags back to their original api object which is a dictionary of the tag category name to the tag's in that category.
The main time expense is converting tags to their respective token for saving, as each tag has to query the database for its id. I speed this up by maintaining a tag cache of tag name to tag id while processing the post export. This can be further sped up by batch requesting the tag ids from the names rather than getting each one separately, however I didn't want to work that out when I made it, so for now every tag that isn't in the cache is requested by itself, rather than in a batch.
Since parent/child relationships could be out of order, I also have to maintain a hanging relationships database. After the post exports are processed, every post has its children updated, if it has any. This can take a while as it has to find every post that has a parent, check if the parent post exists, and if the parent post doesn't have the child id already in it, and only then insert it. If the parent post doesn't exist, then it has to add the relationship to the hanging relationships, as the post may appear in the futre. Hanging relationships allow parent posts to almost always have their children correct when they are added.
This export takes 20-30 minutes to process fully with the current 4 million posts.
Tag alias exports
This simple index holds each tag alias' id, antecedent name, and consequent tag id. When a query arrives, all tags are checked for aliases when getting each tag's id.
This export usually takes a few seconds to process.
The update process
To ensure the database is up to date, every 5 seconds updates are processed. However, this only happens after the previous update is finished. Updates take about 50 seconds to process, resulting in a total of about 55 seconds between updates.
This time can vary a lot depending on how many updates happened since the previous update, and how many new tags need to be fetched (1 req/sec). A good range is anywhere between 30 seconds to 1.5 minutes.
Adding new posts
The first thing the update does is check for new posts after the latest post in the database. This usually doesn't take too long as there aren't that many posts added per minute. Unknown tags are automatically fetched from the e621 api if they don't exist.
Checking for updates
There's anywhere between 100-2500 updates most times since the last update happened. To ensure posts are up to date, I check if any commonly changed values are different from what's in the database, or if the updated at value is different, it usually is, but since this requires date parsing, I check right before checking tags.
To ensure we have all updates, I always progress to at least 10 pages of updates. With the 1 request/second rate limit, this takes at minimum 10 seconds to process. If there are updates on the current page, the requester will continue until there are no updates on the page.
Checking for misses
Checking for misses is nearly identical to the above, but orders by change sequence instead of updated at. Rarely does anything get missed, but it's better safe than sorry.
Adding new tag aliases
New tag aliases are also added in the exact same way new posts are added.
The API
Well there had to be some way to get the data from the database via http request. To do this, a very simple api is used. The main api is hosted at https://search.yiff.today.
Converting search text into database query
This was the hard part. Properly resolving all tag groups, negations, OR's, everything, was difficult. Especially because I had never used this kind of database before (elasticsearch, which I found out after I added it that it was actually what e621 used, so I guess our searches led to the same result lol).
The first step in converting search text to a database query is tokenizing the text and processing each token by itself. This breaks up the daunting task of processing the entire query, to processing a single token.
One of advanced search's biggest features, tag grouping, comes from this one process. When tokenized, we consume each token one by one and decide what it does to the overall query. For example, if a token is (, we know that it opens a new group and all tags going forward are inside that group until the closing ). However, since advanced search supports groups inside of groups, we have to ensure that any other ('s and )'s don't affect this group, but rather create another group inside of the existing one. Groups are inserted as a placeholder value like __0 where 0 is the group's index within this group.
Most tokens are just inserted as is, but metatags have to be further processed as they are queries by themselves. All tokens are checked for metatags, and if any are contained, I parse the metatag into the identical database query, and insert a placeholder to preserve execution order, metatag placeholders are like --0 where 0 is the metatag's database query index within the group.
Parsing metatags into database queries
Metatags are tags relating to the data about the image itself rather than the actual user submitted tags. They allow you to search for things liker uploader id, width, height, ratio, etc. Because they deal with data that aren't tags, I need to process them separately into database queries by themselves.
The metatag parser is really boring and I hate the way that I did it, but it works. First I check to see if the token is actually a metatag, if it isn't it's treated like a regular tag token, otherwise I determine the type of metatag it is and turn it into a database query accordingly. Most metatags are simple queries of equality, but some require actual scripting as not all metadata is directly stored. The metatag parser returns a database query which is inserted into the group's parsed metatags array and the placeholder is inserted into the result to ensure the overall query is built correctly.
Back to the overall query parser, after tokens are processed into their groups and metatags, I need to convert all those new tokens into their respective tag id. Doing this is very simple, I start at the top group, and convert all of its tokens to ids, then I loop over the current group's groups and repeat the process over and over again until there are no groups left. When the entire process is done every group is now an array of tag ids, and placeholders, rather than tag names and placeholders.
Finally we can actually build the database query. For the most part we follow a simple flow chart:
- Is the current token a number?
- Treat it as a tag id
- Is the current token a -?
- Negate the next token
- Is the next token a ~, or are we currently resolving a previous OR?
- Treat it as an OR
- Does the current token start with __?
- Repeat the process with the group at the index after the __ in this position
- Does the current token start with --?
- Insert the metatag's preprocessed database query in this position
Eventually this chain will unwind and we'll be left with a usable database query.
Other query params and body
While advanced search does support pagination using page it is much perferred to send a post request with a body that contains the previous request's searchAfter field as {searchAfter: prevResponseSearchAfter}. This is because it can paginate through the entire database, where page cannot.
You can also change the limit of documents returned with limit, there's a hard cap of 320 just for the sake of keeping response times down.
End
That's basically it, I glossed over some things for the sake of my own sanity, but most of it is in there. If you have any questions about how something works, please let me know! I'll answer anything to the best of my ability.
Updated