Topic: E621 Advanced Search

Posted under e621 Tools and Applications

E621 Advanced Search

This project has been long in the making. I've made this to significantly enhance e621's search syntax, in my opinion.

E621 Advanced Search is primarily a tampermonkey userscript to make searching on e621 better.

Download guide:

  • Removed, the script doesn't function, and I have no intentions on fixing it, but the syntax is still used on my site, so I will be keeping this up.

Other Links

RE621

The userscript is untested with RE621. It might work, it might not. Or it might break some things. Use them together at your own risk.

Note: Due to the way pagination works, I've switched to an infinite scrolling post gallery, rather than actual pages. This only happens when you search with advanced search.

E621 Advanced Search adds many things e6's base searching lacks, but it comes with trade offs (see Caveats). The main enhancement is the ability to use tag groups with parentheses allowing you to match different groups outside of other groups.

For example: duo ( male ~ female ) ( ambiguous_gender ~ gynomorph ) will match posts containing a male or female and an ambiguous_gender or gynomorph, this will not match posts that have a male and female, or ambiguous_gender and gynomorph since it also searches for duo. Of course this relies on tags being correct, so please tag posts!

E621 advanced search also supports groups inside of groups, but this is much more complicated and only for the power users. You can use ~ between groups, as well.

Like so: duo ( male female ) ~ ( ambiguous_gender gynomorph ) will only match posts with a male and a female, or a ambiguous_gender and a gynomorph, and since duo was used, we can assume that posts that contain a male will also not contain an ambiguous_gender, and vice versa (and the other two tags).

Like e6, you can also use - to negate the next tag or group -female means the post won't have a female, and -( female male ) means the post won't have a female and male, but could have either of them by themselves.

Here's an in depth example with many different combinations:

Example
( a b c ) ~ ( d e f ) means > (a & b & c) OR (d & e & f) a b ( c d ) means > a & b & c & d (nothing special) -a b c means > not a, b & c a b -( c d ) means > a & b, not (c & d) (meaning it can have either c or d, but not both) a ~ b c means > (a OR b) & c a ~ b ~ c d means > (a OR b OR c) & d a ~ b ~ ( d e ) means > (a OR b OR (d and e)) (either a or b or the post has both d and e) a ~ b ( -c ~ -e ) means > (a OR b & (not c OR not e)) (meaning a or b AND the post doesn't have c or the post doesn't have e) ( -a ~ b ) means > (not a) OR b a ( b ( c ) ) means > a & b & c a ( b ~ ( c e ) ) means > a & (b OR (c and e))

Notes:

  • Group parentheses should always be preceeded and succeeded by a space unless the parenthesis is the first or last character in the query, or the preceeding character is a - (see below)
    • Don't do: (cat dog)
    • Do: ( cat dog )
  • ~ should always be between two tags, it should never be directly next to a tag
    • Don't do: ~cat ~dog
    • Do: cat ~ dog
  • - SHOULD be directly preceeding a tag
    • Don't do: - cat
    • Do: -cat or -( cat dog )
  • There is a max complexity, that you probably won't hit, but if you do, just know, I'm not changing it
    • Wildcards are the most complex due to the way they work
  • I'm not sure what the max tags are, it's not something I've hard coded, but you will eventually hit the above complexity cap

In case you're wondering how powerful this syntax is, I offer these queries (they're weird, what can I say):

Query 1

Expanded: female ( presenting ~ seductive ) ( ( rear_view ( panties_down ~ panties_aside ) ) ~ ( upskirt ) )
Simplified: female presenting ~ seductive ( ( rear_view ( panties_down ~ panties_aside ) ) ~ upskirt )

I personally prefer expanding my queries, but it doesn't matter, they do the same thing.

This query will match all posts containing a female either seductive or presenting, and:
- a rear view with panties down or aside;
- OR upskirt

This kind of query is impossible with the built in search, and showcases the power of the syntax of tag grouping with the or logic.

Query 2

Expanded: animated ( ( irrumatio ~ face_mounting ) ~ ( oral pov ) )
Simplified: animated irrumatio ~ face_mounting ~ ( oral pov )

This query returns all animated posts that has:
- irrumatio or face mounting;
- OR pov oral

This query is good at showing the power of the syntax using grouping with the default and logic.

Additions

Any additions that advanced search has over the normal search functionality will be listed here.

  • It is now possible to both search and order by non-implicated tag count. This means you can search for low tag posts while ignoring implications. Mainly useful for the visual tagger
    • You can use tltagcountgen, tltagcountart, tltagcountcopy, tltagcountchar, tltagcountspec, tltagcountinv, tltagcountmeta, and tltagcountlore to search the specific tag categories.

Caveats

When using e621 advanced search, you'll notice a few differences:

  • Blacklisting by uploader username no longer works
    • I don't save user data anywhere. It's impossible to efficiently grab it on the client. So this feature will not work
      • I recommend blacklisting by uploader id instead anyways as it is better
  • Searching by any username related field no longer works, this goes for approver (approver:), uploader (user:), favorites (fav:) etc
  • Posts will show as not being favorited, even if they are
    • There's no real good way to get all favorites of a user without it taking forever, so it's generally just not worth it to do so
    • Use the id versions if they exist. You can add the ! if you want, but it will be ignored and all versions will be treated as if they were ids
  • Searching for votes (voted:)/commented on by (commenter:)/noted updated by (noteupdater:)/deleted by (deletedby:) of a specific user is no longer possible
  • Searching by description (description:)/note (note:)/delete reason (delreason:) is no longer possible
  • Searching for posts pending replacements (pending_replacements) is no longer possible
  • Searching for posts in any pools (inpool:)/sets (inset:) are no longer possible
    • You can however search by specific pool (pool:)/set (set:) using ids, as in pool:1 or set:1 as posts in these are fetched on the client and OR'd into the search query ( id:1 ~ id:2 ... )
    • They aren't in the database export
    • inpool is technically possible, but I have yet to find a use that you'd be looking for posts in any pool, and not a specific one
  • Ordering by comment_bumped is no longer possible
    • e621 doesn't return comment bump in the api, and it's not present in the database export
  • Ordering by changed is no longer possible
    • change_seq is not available in the database export, and fetching all posts to get it is expensive
    • You can order by updated, however, but this will include any post updates
  • Order tags are always top level regardless of where they appear in the query
    • This does mean you can order by multiple things, however. Default order is by id. There is no limit on the amount of order tags you can use
      • If order:random is present, all other order tags are ignored
      • If order:rank is present, the rank score will always be the first order factor
  • File sizes are now exact, use the range operator (..) to define a range if necessary
  • -status:deleted is assumed in every search unless specifically present somewhere in the group, or parent groups. This does not look at child groups, so if you want deleted posts inside of a group, put it at the top most where it could apply.
  • All sort values are assumed descending when using order:, use _asc to sort ascending
  • Sometimes deleted posts are missed and will still show in the posts, these should last no longer than a few minutes
  • Currently all date related tags must be in ISO format
    • Eventually I will make a parser to do what e621 does with it's "yester" and "ago" like syntaxes, but at a later date

These caveats are why when using this script, you have the option to also search normally. On the home page and posts page, there is now an additional button that will say "Search normal," when clicked, this will execute a normal search, without going through the extension.

With all that out of the way, let me know what you think! If you find any issues, send them here or on the github. Mainly if some tag doesn't work, or if a post doesn't show up that you expect to, or if a post is showing that shouldn't.

Any feedback is welcome!

Other Projects
yiff.today - Online e621 slideshow

https://yiff.today, an online slideshow viewer for e621 with some pretty nice features, if I do say so myself. Uses this search api and syntax.

Check out the post here: https://e621.net/forum_topics/40665

Visual Tagger

https://yiff.today/visualtagger, lets you tag posts with a similar UI to the slideshow viewer, but also lets you view implications right next to the image itself.

Has plenty of useful features, like:

  • Redoing changes from a previous post, mainly for quickly tagging alternate images
  • A quick identification of low post count tags
  • You can see what tags are implicated when you add a tag, and all of them will show in the menu
  • And more, read the post for a more complete feature list

Check out the post here: https://e621.net/forum_topics/40904

Technical doohickies

While the code is open source here, and you can fully run this locally, what if you just want to know what's going on under the hood without having to actually look at the code.

This is a mixed level overview of how this works, including how I get posts, update them, etc.

The beginning

Database exports

We start off with the database exports. When first starting the system, you won't have any data, and fetching all of it from the api would take hours (latest post id: 4350449, max of 320 results per page = 4350449/320 = ~13,596 api requests at a rate limit of 1 request per second = 13,596 seconds = ~3.8 hours, not including stuff that's purged from the site, so the actual time will be a bit less) by the time you're done processing it, all of the data will be massively outdated, and you're going to be playing catchup, so instead we just process the database exports.

The exports this system processes are the post exports, tag exports, and tag alias exports. All exports are processed in batches of 10,000 to speed things up.

Tag exports

The tag export is processed first. This is because the posts export relies on having tags inside the database already to resolve the tag string to tokens. This takes a few seconds at most as it's a small file, and only 3 fields are saved, the id, the tag name itself, and its category.

As of November 2nd, post count and updated at are also saved.

Post exports

The post exports takes the longest to process. Not only is it the biggest, but it's also the most complex.

To save storage, post tags are saved as their id, rather than their actual words. This is because numbers are faster to query than searching an entire string for a substring. Instead, we search an array for a number.

However, to save the category of each tag without having to rebuild it, posts tags are saved twice, one time in a flattened array, and the other in a 2d array, where the index is the id of the tag's category. While this is slightly space inefficient, it's worth the time save when rebuilding the tags back to their original api object which is a dictionary of the tag category name to the tag's in that category.

The main time expense is converting tags to their respective token for saving, as each tag has to query the database for its id. I speed this up by maintaining a tag cache of tag name to tag id while processing the post export. This can be further sped up by batch requesting the tag ids from the names rather than getting each one separately, however I didn't want to work that out when I made it, so for now every tag that isn't in the cache is requested by itself, rather than in a batch.

Since parent/child relationships could be out of order, I also have to maintain a hanging relationships database. After the post exports are processed, every post has its children updated, if it has any. This can take a while as it has to find every post that has a parent, check if the parent post exists, and if the parent post doesn't have the child id already in it, and only then insert it. If the parent post doesn't exist, then it has to add the relationship to the hanging relationships, as the post may appear in the futre. Hanging relationships allow parent posts to almost always have their children correct when they are added.

This export takes 20-30 minutes to process fully with the current 4 million posts.

Tag alias exports

This simple index holds each tag alias' id, antecedent name, and consequent tag id. When a query arrives, all tags are checked for aliases when getting each tag's id.

This export usually takes a few seconds to process.

The update process

To ensure the database is up to date, every 5 seconds updates are processed. However, this only happens after the previous update is finished. Updates take about 50 seconds to process, resulting in a total of about 55 seconds between updates.

This time can vary a lot depending on how many updates happened since the previous update, and how many new tags need to be fetched (1 req/sec). A good range is anywhere between 30 seconds to 1.5 minutes.

Adding new posts

The first thing the update does is check for new posts after the latest post in the database. This usually doesn't take too long as there aren't that many posts added per minute. Unknown tags are automatically fetched from the e621 api if they don't exist.

Checking for updates

There's anywhere between 100-2500 updates most times since the last update happened. To ensure posts are up to date, I check if any commonly changed values are different from what's in the database, or if the updated at value is different, it usually is, but since this requires date parsing, I check right before checking tags.

To ensure we have all updates, I always progress to at least 10 pages of updates. With the 1 request/second rate limit, this takes at minimum 10 seconds to process. If there are updates on the current page, the requester will continue until there are no updates on the page.

Checking for misses

Checking for misses is nearly identical to the above, but orders by change sequence instead of updated at. Rarely does anything get missed, but it's better safe than sorry.

Adding new tag aliases

New tag aliases are also added in the exact same way new posts are added.

The API

Well there had to be some way to get the data from the database via http request. To do this, a very simple api is used. The main api is hosted at https://search.yiff.today.

Converting search text into database query

This was the hard part. Properly resolving all tag groups, negations, OR's, everything, was difficult. Especially because I had never used this kind of database before (elasticsearch, which I found out after I added it that it was actually what e621 used, so I guess our searches led to the same result lol).

The first step in converting search text to a database query is tokenizing the text and processing each token by itself. This breaks up the daunting task of processing the entire query, to processing a single token.

One of advanced search's biggest features, tag grouping, comes from this one process. When tokenized, we consume each token one by one and decide what it does to the overall query. For example, if a token is (, we know that it opens a new group and all tags going forward are inside that group until the closing ). However, since advanced search supports groups inside of groups, we have to ensure that any other ('s and )'s don't affect this group, but rather create another group inside of the existing one. Groups are inserted as a placeholder value like __0 where 0 is the group's index within this group.

Most tokens are just inserted as is, but metatags have to be further processed as they are queries by themselves. All tokens are checked for metatags, and if any are contained, I parse the metatag into the identical database query, and insert a placeholder to preserve execution order, metatag placeholders are like --0 where 0 is the metatag's database query index within the group.

Parsing metatags into database queries

Metatags are tags relating to the data about the image itself rather than the actual user submitted tags. They allow you to search for things liker uploader id, width, height, ratio, etc. Because they deal with data that aren't tags, I need to process them separately into database queries by themselves.

The metatag parser is really boring and I hate the way that I did it, but it works. First I check to see if the token is actually a metatag, if it isn't it's treated like a regular tag token, otherwise I determine the type of metatag it is and turn it into a database query accordingly. Most metatags are simple queries of equality, but some require actual scripting as not all metadata is directly stored. The metatag parser returns a database query which is inserted into the group's parsed metatags array and the placeholder is inserted into the result to ensure the overall query is built correctly.

Back to the overall query parser, after tokens are processed into their groups and metatags, I need to convert all those new tokens into their respective tag id. Doing this is very simple, I start at the top group, and convert all of its tokens to ids, then I loop over the current group's groups and repeat the process over and over again until there are no groups left. When the entire process is done every group is now an array of tag ids, and placeholders, rather than tag names and placeholders.

Finally we can actually build the database query. For the most part we follow a simple flow chart:

  • Is the current token a number?
    • Treat it as a tag id
  • Is the current token a -?
    • Negate the next token
  • Is the next token a ~, or are we currently resolving a previous OR?
    • Treat it as an OR
  • Does the current token start with __?
    • Repeat the process with the group at the index after the __ in this position
  • Does the current token start with --?
    • Insert the metatag's preprocessed database query in this position

Eventually this chain will unwind and we'll be left with a usable database query.

Other query params and body

While advanced search does support pagination using page it is much perferred to send a post request with a body that contains the previous request's searchAfter field as {searchAfter: prevResponseSearchAfter}. This is because it can paginate through the entire database, where page cannot.

You can also change the limit of documents returned with limit, there's a hard cap of 320 just for the sake of keeping response times down.

End

That's basically it, I glossed over some things for the sake of my own sanity, but most of it is in there. If you have any questions about how something works, please let me know! I'll answer anything to the best of my ability.

Updated

Note: Due to the way pagination works, I've switched to an infinite scrolling post gallery, rather than actual pages. This only happens when you search with advanced search.

When using e621 advanced search, you'll notice a few differences:

This leads to an obvious question IMO: When is that?

One answer would be, whenever you use advanced query syntax, ie. use queries that contain isolated ~ or ( or ). But I don't want to assume that and I think it would encourage people to use it if you were more explicit about it.

(for example I get the impression that fav: is a reasonably popular way of searching, but can't tell for sure that, providing one doesn't use advanced search operators, fav: would work, only that when one does use advanced search operators, one can't use fav:)

savageorange said:
This leads to an obvious question IMO: When is that?

One answer would be, whenever you use advanced query syntax, ie. use queries that contain isolated ~ or ( or ). But I don't want to assume that and I think it would encourage people to use it if you were more explicit about it.

(for example I get the impression that fav: is a reasonably popular way of searching, but can't tell for sure that, providing one doesn't use advanced search operators, fav: would work, only that when one does use advanced search operators, one can't use fav:)

This is why I provided a way to use regular searching if you wanted by clicking the "search normal" button. I'm not sure when I'll get favorite searching in yet.

Though I am actually wondering what you're saying other than that. Most of these caveats are impossible to fully rectify as they just aren't returned in the api response, or aren't present in the database export. The simple answer is: use it if you want to, if you can find use out of it, go ahead, and if you don't, then don't. I personally find the features gained much more useful than the features lost, which was why I made it in the first place. I figured there may be other people out there like me, so I made it available to everyone.

Updated

definitelynotafurry4 said:
This is why I provided a way to use regular searching if you wanted by clicking the "search normal" button. I'm not sure when I'll get favorite searching in yet.

Ok? That is also not mentioned in your post.

My post is entirely about communication, ie. there is context here that are obvious to you, that you haven't mentioned, that could cause people to not want to try it, independently of however it actually works. I didn't mean to either criticize or praise what you made. I haven't tried it, partially because I don't want to deal with resolving potential RE621 conflicts. All I can say on that front is it seems like it could be useful, and in theory I might use it.

bitWolfy

Former Staff

Is search.yiff.today some kind of mirror of e6's database that you are running?

savageorange said:
Ok? That is also not mentioned in your post.

My post is entirely about communication, ie. there is context here that are obvious to you, that you haven't mentioned, that could cause people to not want to try it, independently of however it actually works. I didn't mean to either criticize or praise what you made. I haven't tried it, partially because I don't want to deal with resolving potential RE621 conflicts. All I can say on that front is it seems like it could be useful, and in theory I might use it.

Yeah I did miss adding that, mainly because I didn't see it as a feature really, rather just something that was there. I will add this into the main post I've added that, now what else were you saying I missed? Like you said, I have much more context so I don't even think about most things that are in there, so if I missed anything else in the main post, let me know I'll provide more context.

bitwolfy said:
Is search.yiff.today some kind of mirror of e6's database that you are running?

I indexed the database export and use the api to fetch updates. It is kinda a mirror, but it contains much less data. Only taking up about 6GB. Since I can't query e6's database directly, I need to build my own, and that means I gotta host a front facing api somewhere. I just hosted it off my other site since I didn't wanna pay for another domain name.

I should've mentioned somewhere that it's all open source and you can find it all here: https://github.com/DontTalkToMeThx/e621AdvancedSearch/tree/main

As of 2023-10-13 5:48 pm ET, I have also added a pretty lengthy technical disclosure.

Updated

savageorange said:
for example I get the impression that fav: is a reasonably popular way of searching, but can't tell for sure that, providing one doesn't use advanced search operators, fav: would work, only that when one does use advanced search operators, one can't use fav:

I've also just added fav: for user id (as in fav:495015), this will expand like pools and sets to ( id:favoritePostId1 ~ id:favoritePostId2 ... ). Still nothing for usernames yet. If I can get something working for usernames to convert to ids efficiently, I will add it.

This does not work properly since it'd only get the first page of favorites.

Updated

definitelynotafurry4 said:

  • Searching for posts in any pools (inpool:)/sets (inset:) are no longer possible
    • You can however search by specific pool (pool:)/set (set:) using ids, as in pool:1 or set:1 as posts in these are fetched on the client and OR'd into the search query ( id:1 ~ id:2 ... )
    • They aren't in the database export

Sets aren't, but pools very much are?

wat8548 said:
Sets aren't, but pools very much are?

Pools are a database export of their own. Processing exports already takes a while, downloading and processing another export could quite increase the time it takes to download. Since you can already search pool by id as well, inpool:true is more often than not a tag that won't be helped by this search syntax as any time you're using it, you probably don't need the advanced syntax and should just use the search normally button.

If there is a reason that the advanced syntax would be needed while using inpool:, please let me know and I'll consider processing the export. I personally have never used inpool: before, so I didn't find any reason to process the export just to support one metatag.

New update, order:rank is now supported (this was undocumented to my knowledge, I had no idea it existed). The scoring function is the exact same, but resolving it with other tags will work properly. As stated in: https://e621.net/forum_topics/40963, when using order:rank with other tags, some were just missed for some reason. E621 advanced search gets it right, however, and you should expect to see proper results when using filtering.

Also, a meta tag parsing issue was fixed for fields using the range syntax

Random ordering has been fixed, along with randseed. These should work properly now.

As of 11/30/23, you can now search by top level tags, aka non-implicated tags. This will allow you to find posts that have low tag counts while ignoring implicated tags.

The syntax for this is the same as tagcount, but you can use these keywords: tltagcount, topleveltagcount, or nonimplicatedtagcount.

Examples:

  • nonimplicatedtagcount:5
  • tltagcount:<10
  • topleveltagcount:4..8

Updated

I'm having difficulty using this. I input a very basic search to test it out: ( cat ~ dog ) ( low res ~ animated )

But the results page was totally blank! It didn't give me the failed search message, just... blankness.

Am I doing something wrong?

arousedunderling said:
I'm having difficulty using this. I input a very basic search to test it out: ( cat ~ dog ) ( low res ~ animated )

But the results page was totally blank! It didn't give me the failed search message, just... blankness.

Am I doing something wrong?

It appears that the userscript has broken at some point, I'll have to look into why

arousedunderling said:
I'm having difficulty using this. I input a very basic search to test it out: ( cat ~ dog ) ( low res ~ animated )

But the results page was totally blank! It didn't give me the failed search message, just... blankness.

Am I doing something wrong?

Should be fixed in v0.4

arousedunderling said:
Works like a charm now! That was a way quicker response time than I was expecting! Thanks a ton!

Np, just make sure to keep the listed caveats in mind as they may interfere with your expectations.

Maintenance is being done on the database to add a new meta-search possibility. Search time may be degraded during this time while all documents are updated to add the new field. It will allow you to search non-implicated tags by specific category rather than just overall. More about the syntax will be added here once the updating is done. This will take anywhere between 5-10 hours.

EDIT: This will take longer than expected because the way I stored it is actually not searchable because the database auto sorts arrays for binary searching, making it impossible to maintain the order. All good though, the new way will be faster to query anyways. But will still take about 5-10 hours again...

EDIT #2: All done.
tltagcountgen, tltagcountart, tltagcountcopy, tltagcountchar, tltagcountspec, tltagcountinv, tltagcountmeta, and tltagcountlore.

You can use the range syntax with them, you can also use topleveltagcountCATEGORY where category would be the full name of the category, such as topleveltagcountartist.

Updated

  • 1