Topic: Feature: Simple bad tag rejection

Requested feature overview description.

Executive Summary
Simple exclusion of obviously bad tags that do not otherwise violate policy.

Why would it be useful?

Bad tags are just clutter, tags with spelling errors are just making things needlessly hard to find. Tags with many meanings could expand implication pools to render search of other things less effective.

What part(s) of the site page(s) are affected?
Search, tagging.

==Long description==

Down with tag poison, a while back I took the liberty of borrowing some metadata
for my personal collection, but it seems there isn't a very stringent policy on tagging
apart from TWYS, and use underscore.

I've take the liberty of assembling a tag zoo below, to show some examples of truly bad tags.
I'm afraid I'm a LAMP developer not a Ruby developer so I'll just write some pseudo code
that could fix the problem. Here is some nearly compilable python.

#####Code begins

def containsEmoticon(t):
#lookup table in memcached or db

def isIconicNumber(t):
#lookup table in memcached or db

def isRecentishYear(t):
return 1400 < int(t) <3000

def isFakeCompoundWord(t):
return not isCompoundWord(t) and englishWordsConstituents(t)

def sanitizeTag(t):
if isArtist(t) and not beginsWithArtistPrefix(t):
#could be OC but warn that tag for artist provided as regular tag

if containsEmoticon(t):
#reject
return False

if isNum(t) and not isIconicNumber(t) and not isRecentishYear(t):
#not 007 or 2016 reject
return False

if t.contains("-") and areWordsOrNumbers(t.split("-"):
#warn suspiciously like it should use underscore
return SUSPECT

if t.contains(regex("[:alpha:]{3,}")):
#3 or more repeated letters is probably not a real word
#"CANONBALLLLLL"
#"Creeeeeeeeeeeedit"
#"Biiiiiiig"
return False

if hasRepeatedSubstring(t) >2 :
#murmur is a word but #murmurmur is not
#"eheheheehehehhehheheheeheheehhhe"
#"burrburrburrburrburrburr"
# the letter r is 50% of the word burr
return false

if isFakeCompundWord(t):
#"awesome_face"
#"AwesomeFace"
return corrected(t)

if englishWordsConstituents(t):
#"bleu"
#"blue"
#Whoops bleu and blue are close, but one is french for blue
return False



#####Begin tag Zoo #####
##Use crtl-f to search for # or use markdown idk

#Garbage emoticon tags, use words instead
"^_^"
"^_^'"
"_too"
"--"

#are these suffixes?
"-color-me-series"
"-esque"
"-insert"

#Some usernames have dashes, deal with it
"-kronexfire-"
"-like"
"-Sama"
"-WS-"

#more garbage emoticons
":"
":<"
":3"
":c"
":d"
":o"
":q"
":v"
"!"
"?"
"?!"
"..."
"+"
"</3"
"<3"
"<3_eyes"
">:)"
">:3"
">:d"
">o<"
"$"

#some numbers are iconic
"007"

"01"
"02"
"082"
"09tuf"

#Numbers substituted for letters but differently in two instances
#but both words technically contain english so arg.
"0rang3"
"0range"
"0rcawolf"

"1"
"1_eye"
"1-1"
"1-up"
"10"
"100"
"1000"
"10000"
"100000"
"10ft"

"15000"
"150mm"
"160mm"
"150th"

#non-descriptive numebers
"10k"
"11"
"12"
"123"
"12345"
"13"
"14"
"1407"
"15"

"16"
"16:10"
"16:9"
"17"

"1k"
"1st"
"2"

"20"
"200"
"2000"
"20000"

#years are ok
"2002"
"2003"
"2004"
"2005"
"2006"
"2007"
"2008"
"2009"
"2010"
"2011"
"2012"
"2013"
"2014"
"2015"
"2016"

"2020"
"2020ad"
"2099"
"20k"
"20s"
"21"
"22"
"22111"
"25"

#Using - where _ should be used
"25-expressions-meme"
"8-pack"

"25000"
"26"
"27"
"28"

"30"
"300"
"30th"
"34"
"35"
"38"
"3am"
"3d"
"3d_(artwork)"
"3D_Coat"
"3DDotNikkiWolf"
"3dinoz"
"3ds"
"3rds"
"3year"
"4"

"4:3"
"40"
"4000"
"40K"
"41"
"420"
"4k"
"4th"
"5"

#Ok number prefixes
"5_fingers"
"5_headed_dragon"
"5_toes"

#Undescripive numbers
"50"
"500"
"5000"
"50s"
"50th"
"55"
"568"
"57"
"59"
"5k"
"5th"
"6"
"6_toes"
"62"
"626"
"63"
"64"
"69"
"6pklion"
"7"
"7_toes"
"7/11"
"70"
"70s"
"7espada"
"8"
"8_ball"

"80"
"805"
"80s"
"88"
"8C"
"8D"
"9"
"9-puzzle"
"90s"
"92"
"a"

#ahhhh
"Aha"
"ahahahah"
"ahahahaha"
"ahead"
"ahegao"
"AHH"
"ahhh"
"ahhhh"
"ahhhhh"

"applejack"
"applejack_(mlp)"

#MOJIBAKE
"â™ "
"♦"
"♣"
"♀"

"aw"

"awesome_face"
"Awesomeface"

#exclamation
"aww"
"Awww"
"AWWWW"
"Awwwwww"

#using - where it should be _
"b"
"b-bit"
"b-day"
"b-movie"
"ba"
"ba-bada-bap-"
"ba-da"
"Baaawwww"

#Stretched words
"Bigggggg"
"Biiiiiiiiig"

#threesome? probably not an actual sandwhich
"bisexual_sandwich"

#French
"bleu"
"blue"

#mojibake, this might just be the fault of the API
"blue_(pokémon)"

"brrrr"

#theese are treated differently, this might just be an artifact
"bdsm"
"BSDM"

"burrburrburrburrburrburr"

#mojibake
"狼狼君(潜水中)"

"CANONBALLLLLL"
"Creeeeeeeeeeeedit"
"eheheheehehehhehheheheeheheehhhe"

#........

"zzz"
"zzzzzzzzzzz"

Updated by NotMeNotYou