Topic: Tag syntax expansion question/request for comment

Posted under Off Topic

Alright, so for a while, I've had this PR cooking over on the github which creates a new tag syntax that allows for one thing I see requested so often which isthe ability to make groups of OR requests. The details and implementation are not important to the context of this topic.

However, the primary pushback I see is that the syntax doesn't make sense, or is too different. But this feature is probably one of the most requested features that I know of, so push forward I must.

So here I am, asking you how you would like to see this implemented?

Here's a few things to consider:

  • The syntax should support being able to do OR NOT, which the current syntax doesn't actually support as ~-tag not only looks weird, but it only checks for the starting character.
    • My syntax fixes this by separating OR from the tags: tag1 ~ -tag2 reads as "tag1 or not tag2"
  • The syntax should support proper grouping regardless of operator.
    • This is because supporting this allows implicit support of allowing multiple OR group chaining, which is the main point of the request
    • As in being able to search for posts with "tag1 and tag2, or tag3 and tag4" which for my syntax would look like ( tag1 tag2 ) ~ ( tag3 tag4 )
      • Or "tag1 or tag2, and tag3 or tag4", which would be the opposite: ( tag1 ~ tag2 ) ( tag3 ~ tag4 )
      • What's important here is that the groups resolve separately, meaning that, in the first example either group has to pass because the OR is between the two groups, and in the second example both groups have to pass because the default operator is AND, however the ORs are within the groups so tag1 and tag4 work, or tag2 and tag3, or tag1 and tag3, you get the idea.
      • This syntax allows for complex queries like: ( tag1 tag2 ~ ( tag3 tag4 ) ) ~ tag5 which would mean posts that have (tag1 AND (tag2 OR (tag3 AND tag4))) OR tag5, the parentheses in the explanation are just to properly show the order.
    • Basically operators shouldn't care what comes before or after it, being a tag or group
    • This means that ~( tag1 tag2 ) ~( tag3 tag4 ) isn't exactly great because that means that every tag inside the group is AND'd (or OR'd, it's kinda ambiguous) together, which isn't always wanted
  • The syntax has to consider tag rules
    • For example have tags that end with ) so groups either need to be a difference character, or be separated by spaces, which is what I've done with my syntax

I personally don't think it's possible to make a fully backwards compatible system with the rules currently defining our search syntax since it's so limited, but perhaps you can think of one that meets these requirements, isn't too far off what we already have, and can be agreed upon by users.

Again the programming behind it doesn't matter, and it's basically already done this is solely for the syntax which is changes to the tokenizer, rather than the logic.

Offtopic programmer jargon

I can't say for sure this will be implemented as-is, because my ruby is not the best, so it might have to be rewritten, which would obviously add time to it hitting the site.

I can't say it will be implemented at all, since that's out of my power.

I'm just the guy that is utilizing the open source nature of the site to push ideas onto the table that might be worth pursuing

Updated

I think the inability to have backwards compatibility coupled with the learning curve might kill any chances for this to get adopted as the searching system. A lazy way out of this conundrum would be to make the different types of search a toggled option, but that doesn't help because it would still affect {{these kinds of links}} and previously saved/bookmarked searches.

Also yeah I dislike the inability to handle ~-. If we want the functionality, either we can use your system, use another symbol to represent ~- (maybe ^) or just have the search be able to handle ~- directly (though given how long it's been like this, it's not that straightforward to implement)

I think the best way to implement the syntax is just to go with what we already have and build on it. So
~(~t1 ~t2 t3) ~(-t4) (t5 t6) would mean (((t1 OR t2) AND t3) OR -t4) AND (t5 AND t6)

snpthecat said:
A lazy way out of this conundrum would be to make the different types of search a toggled option, but that

This is what the PR currently does

snpthecat said:
~(~t1 ~t2 t3) ~(-t4) (t5 t6)

This could perhaps be useable, my main problem with adding an OR operator to a single tag, is that it isn't logic that can be applied to a single tag. In order for OR to make sense, it requires two sides. However, with the current syntax, that isn't doable, so this might be the only proper middleground.

I think SNPtheCat's suggestion makes sense here; backwards-compatibility is important.

Also, it might be good to support negating a group. De Morgan's law means it's not a necessary feature for full boolean logic in searches, but it would prevent needing to know boolean algebra to do so.

scth said:
Also, it might be good to support negating a group

That's already supported in my syntax and what was meant by the requirement here:

  • Basically operators shouldn't care what comes before or after it, being a tag or group

operators being ~ and - basically.

The problem I've always had is what to do with the begin group and end group operators. I appreciate a clean grammar, which means you want to avoid overloading characters or changing their use depending on context. ( and ) are already heavily used in tags, so to keep things I'd want to avoid them for representing groups. As said above, { and }, along with [ and ], are already taken by the forum, so I'd want to avoid them too in order to not have to touch the forum's syntax. That leaves < and >, which tend to be associated with XML but would technically work, and I think are our best candidates.

As for the searching itself, I don't see any problem with making this backwards compatible. From a lexer perspective, there doesn't need to be any difference between the search tag and the search <tag>, as both can be treated as expressions that apply the unary operators outside of them. This would enable backwards compatibility with the existing grammar, which is very important for people with bookmarks. So, for example, if someone wanted to do the search tag1 ~<tag2 tag3> -<tag4 tag5>, it would be parsed into the following expression in functional notation: AND(tag1, OR(AND(tag2, tag3)), NOT(AND(tag4, tag5))). Now, this does have a bit of a wonky situation of its own where a user might expect ~<tag2 tag3> to be equivalent to ~tag2 ~tag3, but I think it's inevitable given that ANDs are implicit to the expression rather than explicit, but it's nothing that I don't think can be overcame.

The other challenge this brings is that it makes search complexity harder to compute. Do we want to preserve the 6 tag limit across all tags (all leaf nodes in the lexer syntax tree) for searches, or do we want to allow something more complex? If we did allow expanding, that'd bloat the SQL queries and make each query quantifiably more expensive, however if we didn't then this new syntax would quickly be limited in what it could do. I'd say maybe see if the site could afford to up it to 9 tags, but that'd be NMNY's call in the end, I think, since Dragonfruit's the one financing the site.

Those are my thoughts on this so far.

kyureki said:
that'd bloat the SQL queries and make each query quantifiably more expensive

The queries aren't SQL, they're in opensearch.

And the main thing I hear is backwards compatibility is the main thing that's needed, so making an entirely new character set, < > probably won't be the greatest in that way. But like I said a fully backwards compatible syntax is limiting ourselves to the mistakes made during the original creation that didn't have the forethought applied to allow for these groups in the first place. Not that it's anyone's fault specifically, just that it wasn't a desired feature at the time and extending the syntax to allow it with the limitations of the existing syntax is near impossible.

tarrgon said:
The queries aren't SQL, they're in opensearch.

Ah, my bad, I hadn't actually seen that deeply in the code itself.

tarrgon said:
And the main thing I hear is backwards compatibility is the main thing that's needed, so making an entirely new character set, < > probably won't be the greatest in that way. But like I said a fully backwards compatible syntax is limiting ourselves to the mistakes made during the original creation that didn't have the forethought applied to allow for these groups in the first place. Not that it's anyone's fault specifically, just that it wasn't a desired feature at the time and extending the syntax to allow it with the limitations of the existing syntax is near impossible.

You aren't really fixing it though, as the implicit AND still exists in your version, and in fact makes things more complex by making AND (implicit infix ternary), OR (explicit infix ternary), and NOT (explicit prefix unary) all work differently. Without explicit grouping characters, users would be required to learn the precedence order of these operators, which is why grouping operators such as ( and ) tend to exist in mathematics. This creates more cognitive load on users than what I proposed, where the operators are internally consistent. Now, I could see it argued that we should make AND explicit as well, but then you're making searches visually more complex by needing to do something like tag1 & tag2.

As for introducing < and >, I proposed them because no tag currently uses them, and they're a reserved character not used anywhere else in the syntax, thus they would still preserve backwards compatibility while offering grouping delimiters. These grouping delimiters are important to allow overriding of precedence, as described above. Your only alternatives are to switch to using either prefix or postfix notation, such as polish notation in mathematics. However, this requires that all operators be explicitly defined.

What's cool about having a local copy of the database exports, is that I can just run these complex searches very easily, and without worrying about server resource limitations.

alphamule said:
What's cool about having a local copy of the database exports, is that I can just run these complex searches very easily, and without worrying about server resource limitations.

That's basically what I did for https://e621.net/forum_topics/40732, except just gave it an api and a syntax I prefer. I mean I don't think the userscript works anymore, but the backend does and I use it as my daily driver for janitor related things in my tools.

Here's the breakdown of that syntax:

( a b c ) ~ ( d e f ) means > (a & b & c) OR (d & e & f)
a b ( c d ) means > a & b & c & d (nothing special)
-a b c means > not a, b & c
a b -( c d ) means > a & b, not (c & d) (meaning it can have either c or d, but not both)
a ~ b c means > (a OR b) & c
a ~ b ~ c d means > (a OR b OR c) & d
a ~ b ~ ( d e ) means > (a OR b OR (d and e)) (either a or b or the post has both d and e)
a ~ b ( -c ~ -e ) means > (a OR b & (not c OR not e)) (meaning a or b AND the post doesn't have c or the post doesn't have e)
( -a ~ b ) means > (not a) OR b
a ( b ( c ) ) means > a & b & c
a ( b ~ ( c e ) ) means > a & (b OR (c and e))

Updated

Speaking as an old fart, I tend to read ~ as "invert" (bitwise-negate) and - as "subtraction". I'm a little bit used to - meaning "not" here on e621, though.

I'd use | (or ||) for OR, and ! for NOT. If it needs an explicit AND, then & (or &&) for that. (Most of the time, though, just listing two tags together is an implicit AND.)

What to use as a grouping operator is harder. I think ( and ) are probably the least bad option, but it will take lots of testing to make sure they work right with existing tags that end in parentheses.

Any possible tie-in to the blacklisting system (in terms of making syntax / capacities of search and blacklist comparable) should also be considered. That could make the new / extended search syntax easier to remember purely on the basis of exposure.

  • 1