Topic: [Feature] Expansion of the ASCII characters.

Posted under Site Bug Reports & Feature Requests

Allow site to be able to acept other letter like japanese, chinese, korean and so on for artist name purpose, not every artist has an english name and their pixiv stacc isnt always reliable for naming nor they have twitter to at least get an reliable name.

Also I couldnt update artist wiki because it tells about ACSII characters incompatbility for thoses different characters.

The admins have expressed a desire to phase out non-ASCII characters from tags, so not allowing them for new artists would seem to be intentional. However, I do agree this should still be supported, at least for artist names if nothing else, because not all artists have a romanized name on display, and if an uploader can't read a particular language/script they won't be able to romanize it themselves. Which only leaves the option of copy-pasting the artist's name (which will fail for new artists and non-ASCII characters) or leave it as unknown_artist/none, making it harder to properly tag later when someone can romanize the name and get aliases made.

Wow, brings back memories of Windows 98 days where nothing supported Unicode. Although, to be fair, Unicode is trollsome with having aliases and nearly/exactly identical glyphs for very different codes. Maybe some restriction like romanization and original language variants so we get best of both worlds? (Like E-hentai) Or maybe that makes no sense? :)

Actually, some of those issues are precisely why some games like Evony banned Unicode, which was ironic in some ways.

There should at least be an easier way to alias foreign names to "proper" romanized aliases. I've made a dozen threads so far to alias different artists that either go ignored or overlooked. Getting rid of non-ascii characters doesn't help with this if the only alias there is to search is the name in foreign characters.

EDIT: Now you can't make threads to alias tags with under 50 posts, which just makes it even harder to translate/romanize foreign names.

Updated

Idea: Whitelist the non-troublesome glyphs. Then share said list for everyone else on the Internet dealing with stupid abuse-prone ones. That or have a filter than substitutes the nearly-identical ones to ASCII set if they're too similar.

Another (wired) solution would be this:
Transform the multi-byte representations of non-ASCII chars in even longer sets of ASCII chars allowed in tags.
This would allow the DB to only contain ASCII while still allowing users to search with Unicode (with the same static translation table working in the background).

half of an example (not even using Unicode[1]): ⼡
-> 0x212c45
-> 0x21, 0x2c, 0x45
-> "!", ",", "E" (all direct hits - no control characters here)
-> would need to be represented with at least four of the allowed chars (0-9, lowercase A-Z, "_"/" ", "-" and ???)

The search field would need to check if the input contains Unicode or not and apply the transformation table accordingly.

[1] Wikipedia article)
(the trailing bracket at the end of the URL belongs to the URL but e6 isn't parsing it)

EDIT1
now that I've thought a bit more about it it doesn't even seem that difficult:
0x00 to 0xff are just 256 possibilities and a-z, 0-9, "_" (and maybe "-") are allowed chars.
-> 26 + 10 + 1 = 37 -> 18*19=342 possible combinations if limited to unmistakable transformations ([a-r]=18 for the first and 19 for the second byte) but up to 37^2=1369 if mixups are prevented in some other way.
(input="a⼡" hypothetical output= "as"+"_z"+"gh"+"eu" = "as_zgheu" -> s_, zg and he would not be in the table)

EDIT2
It would be a very special ;-) Binary-to-text encoding and it may even be possible to use a 2 byte -> 3 byte bijective function.

Updated

OP needs to correct their post. ASCII has nothing to do with expanded language characters unless they happen to be talking umlauts and accent marks.

pocket_erector said:
OP needs to correct their post. ASCII has nothing to do with expanded language characters unless they happen to be talking umlauts and accent marks.

WTF are you talking about? (US-)ASCII by definition has nothing to do with any other characters then the standard English alphabet (and some control characters and so on), not even with umlauts and accent marks.

What does the "Original Poster"(?) need to correct?
This incompatibility is exactly what this thread is about or am I missing something?

pocket_erector said:
OP needs to correct their post. ASCII has nothing to do with expanded language characters unless they happen to be talking umlauts and accent marks.

you're probably thinking of codepage 1252 or ISO 8859-1

Clearly the real answer would be a character substitution on the backend that replaces Chinese symbols with pinyin, Japanese with romaji (differentiating between Kanji and Chinese by which unicode pages the characters are on, I'm guessing), Korean with its English-equivalent pronunciation, etc.

Then nobody can really get it wrong, and if there's weird unicode tricks being played by individual artists then they can be aliased away on a case-by-case basis.

Why don't we just use the decoded entities of Unicode artists and store them like in this example (but replace the % with _ instead of &. & was a stupid choice).

The next step could be to add an automatic en-/decoder for the search fields and specifically for the artist field in the upload form.

If artist names can either be completely ASCII exclusive-or Unicode(-ASCII) we can drop all the "%U" except for the first one too (if all decoded entities are always 4 chars long).

--------

EDIT:
This way a *Monkey userscript.js could automatically convert the Unicode entities back to their original characters on the user side without any modification to e621.net itself.
(intercept search and upload forms where Asian artists names are automatically decoded and encode them in the tag lists and so on.)

And @faucet proposed romanizing instead of Unicode entities.

Updated

kalider said:
Why don't we just use the decoded entities of Unicode artists and store them like in this example (but replace the % with _ instead of &. & was a stupid choice).

The next step could be to add an automatic en-/decoder for the search fields and specifically for the artist field in the upload form.

If artist names can either be completely ASCII exclusive-or Unicode(-ASCII) we can drop all the "%U" except for the first one too (if all decoded entities are always 4 chars long).

Even if there's a viewer-end decoder, how are people supposed to search for these tags without memorizing the exact string of random letters and numbers? If you're also suggesting that leaving the names purely decoded is fine, then I'll especially disagree. It's unintuitive, hard to read, will make it difficult to distinguish artists, and isn't even a simple solution (names can be romanized per whatever system is most widely used).

strikerman said:
Even if there's a viewer-end decoder, how are people supposed to search for these tags without memorizing the exact string of random letters and numbers?

The userscript wouldn't just be a decoder but a whole codec which works both ways.
(intercept search and upload forms where Asian artists names are automatically decoded and encode them in the tag lists and so on.)
Looks like you started your reply before I added my edit but posted after I was finished. ;-)

Everyone without the userscript could use eg. https://www.online-toolz.com/tools/text-unicode-entities-convertor.php (which I already linked to).

strikerman said:
If you're also suggesting that leaving the names purely decoded is fine, then I'll especially disagree. It's unintuitive, hard to read, will make it difficult to distinguish artists, and isn't even a simple solution (names can be romanized per whatever system is most widely used).

That is exactly what I'm suggesting for e6's backend.
And while I agree with your contra arguments the pro's outweigh them imho:

  • The artists at least get a different tag each. There are plenty posts with no artist tag at all because the artist doesn't have an ASCII alias or the uploader doesn't know it or whatever.
  • The posts get distinguishable from each other because they have (different) artist tags (which is better then no artist tag at all).
  • At any point in the future it would be possible to convert the decoded Unicode entities back to the original characters. Don't know if that would be possible with romanizing

Updated

kalider said:
The userscript wouldn't just be a decoder but a whole codec which works both ways.
(intercept search and upload forms where Asian artists names are automatically decoded and encode them in the tag lists and so on.)

At that point, why bother with the codec? People who understand ドロマメ will be able to read and write it better than _u30C930ED30DE30E1, while people who don't would need to copy-paste it either way. Same goes for any Chinese, Korean, or other text. So leaving it as the former is better for some people and no worse for others, while the latter will be a problem for everyone to type out and remember. And additionally, in the former case it gives users an opportunity to learn (as someone who's self-learning Japanese, having the original text is nice to work with).

As it is, this site doesn't have a functional issue with non-ASCII tags. The restriction is purely a pragmatic decision; the admins want the tags to be easy to read and write for English speakers. What some of us are asking for is, when an ASCII tag isn't possible, because someone doesn't know how to romanize a new artist or character name for example, to allow non-ASCII characters until someone can translate it and make appropriate aliases, as such artists/characters will otherwise go untagged making it difficult to properly tag later.

watsit said:
So leaving it as the former is better for some people and no worse for others, while the latter will be a problem for everyone to type out and remember. And additionally, in the former case it gives users an opportunity to learn (as someone who's self-learning Japanese, having the original text is nice to work with).

How is it better when no searchable and unique tag exists at all? In my example the OP has added the artist name in the description but that's not always the case.
IMHO an unreadable but kinda correct tag is still better then no tag at all.

watsit said:
As it is, this site doesn't have a functional issue with non-ASCII tags. The restriction is purely a pragmatic decision; the admins want the tags to be easy to read and write for English speakers. What some of us are asking for is, when an ASCII tag isn't possible, because someone doesn't know how to romanize a new artist or character name for example, to allow non-ASCII characters until someone can translate it and make appropriate aliases, as such artists/characters will otherwise go untagged making it difficult to properly tag later.

While complicated Unicode entities could bridge the time between posting images and someone translating or romanizing the artist's name.

But okay. This is kinda a complicated solution for a seemingly unnecessary problem (I see that now).

Maybe add warnings to the upload form when one enters non-ASCII text (with directions to how and where to romanize properly)?!

kalider said:
While complicated Unicode entities could bridge the time between posting images and someone translating or romanizing the artist's name.

As I said, though, the issue is a pragmatic one. The admins want the tags to be easy for English speakers to read and write, so _u30C930ED30DE30E1 would be just as "bad" as ドロマメ by site rules. If someone were to tag _u30C930ED30DE30E1, it would be removed as tagging abuse just as if someone managed to create a new non-ASCII tag. If the admins were to allow _u30C930ED30DE30E1 as a temporary tag until proper romanization, they could just as well allow ドロマメ temporarily too, with the latter being a better option since people who understand the language can more readily see and translate it (it's also shorter, less likely to cause page layout problems).

watsit said:
As I said, though, the issue is a pragmatic one. The admins want the tags to be easy for English speakers to read and write, so _u30C930ED30DE30E1 would be just as "bad" as ドロマメ by site rules. If someone were to tag _u30C930ED30DE30E1, it would be removed as tagging abuse just as if someone managed to create a new non-ASCII tag. If the admins were to allow _u30C930ED30DE30E1 as a temporary tag until proper romanization, they could just as well allow ドロマメ temporarily too, with the latter being a better option since people who understand the language can more readily see and translate it (it's also shorter, less likely to cause page layout problems).

I'm going to chime in on this to confirm this. There is no technical limitations involved in this. The site happily supported unicode tags for many years, but they became a source of administrative overhead and search problems, as the normalization of them posed issues, and they were fairly opaque and hard to deal with for anyone that didn't have the ability to natively type them. Some aliases existed, but they were few and far between. So the hope was to follow in the footsteps of danbooru, and try to encourage users to romanize artists names. If this was successful, or just causes more problems is debatable.

Thus, no level of re-encoding or transformation solves this issue, it just duplicates the problems that already existed, and the policy was hoping to solve.

  • 1