Topic: Feature: Simple bad tag rejection

Posted under Site Bug Reports & Feature Requests

This topic has been locked.

Requested feature overview description.

Executive Summary
Simple exclusion of obviously bad tags that do not otherwise violate policy.

Why would it be useful?

Bad tags are just clutter, tags with spelling errors are just making things needlessly hard to find. Tags with many meanings could expand implication pools to render search of other things less effective.

What part(s) of the site page(s) are affected?
Search, tagging.

==Long description==

Down with tag poison, a while back I took the liberty of borrowing some metadata
for my personal collection, but it seems there isn't a very stringent policy on tagging
apart from TWYS, and use underscore.

I've take the liberty of assembling a tag zoo below, to show some examples of truly bad tags.
I'm afraid I'm a LAMP developer not a Ruby developer so I'll just write some pseudo code
that could fix the problem. Here is some nearly compilable python.

#####Code begins

def containsEmoticon(t):
#lookup table in memcached or db

def isIconicNumber(t):
#lookup table in memcached or db

def isRecentishYear(t):
return 1400 < int(t) <3000

def isFakeCompoundWord(t):
return not isCompoundWord(t) and englishWordsConstituents(t)

def sanitizeTag(t):
if isArtist(t) and not beginsWithArtistPrefix(t):
#could be OC but warn that tag for artist provided as regular tag

if containsEmoticon(t):
#reject
return False

if isNum(t) and not isIconicNumber(t) and not isRecentishYear(t):
#not 007 or 2016 reject
return False

if t.contains("-") and areWordsOrNumbers(t.split("-"):
#warn suspiciously like it should use underscore
return SUSPECT

if t.contains(regex("[:alpha:]{3,}")):
#3 or more repeated letters is probably not a real word
#"CANONBALLLLLL"
#"Creeeeeeeeeeeedit"
#"Biiiiiiig"
return False

if hasRepeatedSubstring(t) >2 :
#murmur is a word but #murmurmur is not
#"eheheheehehehhehheheheeheheehhhe"
#"burrburrburrburrburrburr"
# the letter r is 50% of the word burr
return false

if isFakeCompundWord(t):
#"awesome_face"
#"AwesomeFace"
return corrected(t)

if englishWordsConstituents(t):
#"bleu"
#"blue"
#Whoops bleu and blue are close, but one is french for blue
return False



#####Begin tag Zoo #####
##Use crtl-f to search for # or use markdown idk

#Garbage emoticon tags, use words instead
"^_^"
"^_^'"
"_too"
"--"

#are these suffixes?
"-color-me-series"
"-esque"
"-insert"

#Some usernames have dashes, deal with it
"-kronexfire-"
"-like"
"-Sama"
"-WS-"

#more garbage emoticons
":"
":<"
":3"
":c"
":d"
":o"
":q"
":v"
"!"
"?"
"?!"
"..."
"+"
"</3"
"<3"
"<3_eyes"
">:)"
">:3"
">:d"
">o<"
"$"

#some numbers are iconic
"007"

"01"
"02"
"082"
"09tuf"

#Numbers substituted for letters but differently in two instances
#but both words technically contain english so arg.
"0rang3"
"0range"
"0rcawolf"

"1"
"1_eye"
"1-1"
"1-up"
"10"
"100"
"1000"
"10000"
"100000"
"10ft"

"15000"
"150mm"
"160mm"
"150th"

#non-descriptive numebers
"10k"
"11"
"12"
"123"
"12345"
"13"
"14"
"1407"
"15"

"16"
"16:10"
"16:9"
"17"

"1k"
"1st"
"2"

"20"
"200"
"2000"
"20000"

#years are ok
"2002"
"2003"
"2004"
"2005"
"2006"
"2007"
"2008"
"2009"
"2010"
"2011"
"2012"
"2013"
"2014"
"2015"
"2016"

"2020"
"2020ad"
"2099"
"20k"
"20s"
"21"
"22"
"22111"
"25"

#Using - where _ should be used
"25-expressions-meme"
"8-pack"

"25000"
"26"
"27"
"28"

"30"
"300"
"30th"
"34"
"35"
"38"
"3am"
"3d"
"3d_(artwork)"
"3D_Coat"
"3DDotNikkiWolf"
"3dinoz"
"3ds"
"3rds"
"3year"
"4"

"4:3"
"40"
"4000"
"40K"
"41"
"420"
"4k"
"4th"
"5"

#Ok number prefixes
"5_fingers"
"5_headed_dragon"
"5_toes"

#Undescripive numbers
"50"
"500"
"5000"
"50s"
"50th"
"55"
"568"
"57"
"59"
"5k"
"5th"
"6"
"6_toes"
"62"
"626"
"63"
"64"
"69"
"6pklion"
"7"
"7_toes"
"7/11"
"70"
"70s"
"7espada"
"8"
"8_ball"

"80"
"805"
"80s"
"88"
"8C"
"8D"
"9"
"9-puzzle"
"90s"
"92"
"a"

#ahhhh
"Aha"
"ahahahah"
"ahahahaha"
"ahead"
"ahegao"
"AHH"
"ahhh"
"ahhhh"
"ahhhhh"

"applejack"
"applejack_(mlp)"

#MOJIBAKE
"â™ "
"♦"
"♣"
"♀"

"aw"

"awesome_face"
"Awesomeface"

#exclamation
"aww"
"Awww"
"AWWWW"
"Awwwwww"

#using - where it should be _
"b"
"b-bit"
"b-day"
"b-movie"
"ba"
"ba-bada-bap-"
"ba-da"
"Baaawwww"

#Stretched words
"Bigggggg"
"Biiiiiiiiig"

#threesome? probably not an actual sandwhich
"bisexual_sandwich"

#French
"bleu"
"blue"

#mojibake, this might just be the fault of the API
"blue_(pokémon)"

"brrrr"

#theese are treated differently, this might just be an artifact
"bdsm"
"BSDM"

"burrburrburrburrburrburr"

#mojibake
"狼狼君(潜水中)"

"CANONBALLLLLL"
"Creeeeeeeeeeeedit"
"eheheheehehehhehheheheeheheehhhe"

#........

"zzz"
"zzzzzzzzzzz"

Updated by NotMeNotYou

hsauq said:
What is this really accomplishing that aliasing to invalid_tag or a synonymous/correctly spelled tag wouldn't, beyond saving users the trouble of manually removing invalid_tag and incorrect aliases from a post?

To be frank, I prefer invalid tag because you can use the (linked) wiki to find both the tag(s) and all the rest of the tags aliased to invalid tag. It serves a double purpose in such standard: it prevents bad tags and allows the person to find out bad tags that have been aliased away.

So, -1 from me, instead just proactive use of invalid_tag alias requests, or comments explaining why certain tags are bad / not useful (as I had done with an alternate artist name that had only been tagged once).

Updated by anonymous

Genjar

Former Staff

Maybe we should have an approval process for new tags, or at least for new general tags. Because nowadays nearly all new ones are either typos or redundant, and too many users ignore the 'discuss it first' part of the tag creation process.

Then again, someone would need to actually check those..

Updated by anonymous

I'd have to agree that adding some mechanized scanning of tags could be useful, though I think that automating it might have a few unintended consequences. I would suggest that an error message or banner could appear in these cases when submitting or editing tags on an image. That way, it's not forcing an alias or automatic invalidation of a tag, but the tagger could be immediately aware that they might be incorrectly tagging something.

Updated by anonymous

Please don't necro two year old threads that are dead for all intents and purposes.

Updated by anonymous

  • 1