Topic: [Development] I need some help...

Posted under e621 Tools and Applications

Okay, so I've created a E621 Downloader Script that downloads all the tags that you search for that the index.json file can display. I tried to add an off-site blacklist feature that would refuse to download the picture if it contained a certain tag, and... well, It's not working.

I've been beating my head against my wall so much that I have a roaring headache, and Advil isn't helping.

If anyone has any experience in BASH programming, can you please take a look at my code? I know it's a terrible mess of variables, SEDs and whatnot, but I was going to clean it up later. For now, I would just like to get this blacklist tool working.

Here's a link to the code: http://pastebin.com/Gy4fxPSJ

Updated by Htess

Added the below, but now it's blacklisting EVERYTHING:

checkk="$(cat $taggss)"
...
echo $line | cut -c 37- | cut -c -32 > $check
    wget -O - 'https://e621.net/post/index.json?tags=md5:'$(cat $check) | jq --raw-output '.[].tags' - > $taggss
   if [[ $checkk == $blacklist ]]
   then
...

Updated by anonymous

Bash is not going to be your friend here, because you have to call external commands to test these things. Which means that you're limited to what grep/[ can do vs actually doing some kind of dictionary/array lookup. You could in this case, use grep and convert your list of blacklist items into a regular expression, but that's going to be a real pain in the butt, and testing each individual item using a split and for loop will take a long time because it spawns lots of processes.

Updated by anonymous

Kiranoot: Grep actually supports this exact type of case, see my comments about fgrep and --line-regexp below.

Not to mention that bash does support dictionary lookup (associative arrays).. wouldn't be too hard to change the code to do that instead, but fgrep may be still faster.

If Faux-Pa wanted to implement full interpretation of blacklisting, then yeah, a using a "proper programming language", like Python, would be more advisable.

Faux-Pa:

That's slightly surprising, I would have expected it to instead blacklist nothing. Anyway, a run through/rewrite with comments

#!/bin/bash
# the above (no space after !) is convention.

# needs descriptive name, but is functionally correct
taggs="$(zenity --entry)";

# you don't need to quote any of these, but it's a good habit to quote stuff, so nvm.
# I removed a semicolon since that is literally equivalent to just adding a blank line.

# If you are on linux it's much better to throw these in /tmp/, since /tmp/ is an in-memory buffer, not 
# actual hard drive storage.

tags="./tags.tmp"
op="./id-output.tmp"
parse="./parsed.tmp"
aug="./aug.tmp"
check="./check.tmp"
taggss="./taggss.tmp"
blacklistfile="./blacklist.tmp"

# I'm not sure what you meant here by "${foo bar baz}"
#
# But I know arrays will work here so I changed it to an array.
# I single-quoted a few things (that contained parentheses) because that's necessary
# Also demonstrating \[end of line] for continuing things over multiple lines
# 
blacklist=(areola blithedragon breasts camel_toe captainchaos\
 cervix christmas cloaca crossgender cum_in_pussy daughter diaper\
 feces female 'fizz_(lol)' flat_chested fluffbugveyll girly hyper_balls\
 hyper_muscles hyper_penis intersex kazecat leggings loli 'luka_(artist)'\
 makeup morbidly_obese mother mukomizu my_little_pony panties pet pussy\
 pussy_juice rating:s sagorashi sister skirt slave teats tokifuji tribadism\
 unbirthing urethral_penetration urine uterus vaginal vore zerolativity 'zorro_re_(artist)')

# So you typed in some tags before, you are squeezing the repeated spaces and changing them to +.
# Fair enough, but a) don't trust echo, and b) always quote expansions if they may contain spaces, $, # or other shell-special characters.
#
# you could also store the whole result in the variable tags, rather than in a file, using process expansion as you do for taggs at top of file
printf '%s' "$taggs" | tr -s ' ' | tr ' ' '+' > $tags

# I'll just pretend we did that to begin with:

tags="$(< "$tags")"

# .. and then this can become cleaner:
#
# NOTE: both your existing code and my alteration is vulnerable to the same problem, which is that some characters may need %-escaping in
# order for e621 to interpret the query correctly. I would use python's urllib.parse.quote_plus to alter tags in such a way,
# as it's designed for this kind of POST query fragment.
#
# I changed the single quoting to double quoting so that the expansion would work

wget -O index.json "https://e621.net/post/index.json?tags=$tags"

# I guess this is just a nicer formatting thing?
# you might instead want to specify -O - for wget so you don't need a temp file and just pipe straight into jq

echo ""
cat ./index.json | jq --raw-output '.[].file_url' - > $op

printf '%s\n' "${blacklist[@]}" > $blacklistfile

while read -r line
  do
    # using bash prefix and suffix removal here
    # first remove all the path components
    md5="${line##*/}"
    # and then the extension
    # (correct me if that is not what the original line was supposed to do)
    md5="${md5%%.*}"
    # I think Tony alluded to the fact that the post/view api supports lookup by md5 and it's more efficient than post/index,
    # you may want to consider it.
    wget -O - "https://e621.net/post/index.json?tags=md5:$md5" | jq --raw-output '.[].tags' - > $taggss

   # $(cat taggss) == blacklist does not do what you want (search == in `man bash`, note that it says *a* pattern)
   # you probably want fgrep.
   # BTW, if you piped jq's output through `tr ' ' \\n` and used the '--line-regexp' option for fgrep  , you could get a perfectly exact result(rather than "false" positives when the blacklisted tag is a substring of one of the post's tags)
   
   if fgrep -f "$blacklistfile" < $taggss > /dev/null;
   then
    echo "This post contained tags that you have blacklisted. Skipping..."
   else
    echo "";
    echo "Downloading $line" > /dev/null;
    wget "$line"
   fi
  done < $op
 
echo "Task complete"

In my fairly basic tests, the above code works (eg. searching hyper_breasts, everything is blacklisted, whereas cute -> only some blacklisted).

Updated by anonymous

savageorange said:
corrections and comments

Oh my god... Thank you so much! I've been trying to fix this all day, and you whip out the solution like it was no problem at all.

I'm going to read over your comments and see if I can clean up the code a bit, but it seems to be working perfectly. Thank you thank you thank you thank you.

Updated by anonymous

No worries, happy to help. Bash works well for this task but it's a little cryptic/quirky . If you are planning to expand it in the future, it might be worth porting to something more straightforward like Python+subprocess module.

Btw a few comments got mangled, it's supposed to say \\[end of line] but e621 ate the backslash. Similarly, the comment about tr should have two backslashes before the n, not one.

Updated by anonymous

savageorange said:
No worries, happy to help. Bash works well for this task but it's a little cryptic/quirky . If you are planning to expand it in the future, it might be worth porting to something more straightforward like Python+subprocess module.

Btw a few comments got mangled, it's supposed to say \\[end of line] but e621 ate the backslash. Similarly, the comment about tr should have two backslashes before the n, not one.

Don't worry, I got it. The follow-up to this will be the cleaned-up version of the code. I removed your comments, but only for cleanliness's sake. Again, thank you so much.

Updated by anonymous

savageorange said:
Kiranoot: Grep actually supports this exact type of case, see my comments about fgrep and --line-regexp below.

Not to mention that bash does support dictionary lookup (associative arrays).. wouldn't be too hard to change the code to do that instead, but fgrep may be still faster.

I know it IS possible through grep/[ but that it isn't ideal and you're going to have to fight with it significantly more than if it was done some other way. I'm just inherently biased against bash scripts for processing data, as it's most often just adding complexity and obfuscation to the task, by forcing massaging of data for the next command.

Thank you for taking the time to actually explain how to do it using those tools. I'm a grumpy person sometimes.

Updated by anonymous

KiraNoot said:
I know it IS possible through grep/[ but that it isn't ideal and you're going to have to fight with it significantly more than if it was done some other way.

I disagree. Any examination of the manpage should reveal it is perfectly straightforward. If fgrep is not perfectly satisfactory, then use comm : set membership tests are the entire purpose of comm.

I wouldn't argue that bash can be more complex and obfuscated in general -- it's like perl in a few ways. If I was writing this script from scratch, I would probably use Python with Requests[1]. Or I might mix in python in the middle of a bash script -- no reason why not.

IMO that is part of why bash can be cryptic -- there are usually hundreds of ways to do the same thing, not all of them are equally obvious in meaning (eg. 'comm' is a contraction of 'common', but that is not obvious. Though by comparison to grep it's dead obvious;)

[1] I wrote it up, just for curiosity value:

#!/usr/bin/env python3

import sys
import json, subprocess
from requests import get
terms=sys.argv[1:]

with open('./blacklist','r') as f:
    blacklist = set(v.strip() for v in f.read().splitlines())

for record in json.loads(get('https://e621.net/post/index.json', data={'tags' : terms}).text):
    url = record['file_url']
    tags = record['tags'].split(' ')
    if blacklist.intersection(tags):
        print("Blacklisted")
    else:
        subprocess.run(['wget', url])

Requires 'blacklist' file (one tag per line) in CWD.

Updated by anonymous

savageorange said:
I disagree. Any examination of the manpage should reveal it is perfectly straightforward. If fgrep is not perfectly satisfactory, then use comm : set membership tests are the entire purpose of comm.

I wouldn't argue that bash can be more complex and obfuscated in general -- it's like perl in a few ways. If I was writing this script from scratch, I would probably use Python with Requests[1]. Or I might mix in python in the middle of a bash script -- no reason why not.

IMO that is part of why bash can be cryptic -- there are usually hundreds of ways to do the same thing, not all of them are equally obvious in meaning (eg. 'comm' is a contraction of 'common', but that is not obvious. Though by comparison to grep it's dead obvious;)

[1] I wrote it up, just for curiosity value:

#!/usr/bin/env python3

import sys
import json, subprocess
from requests import get
terms=sys.argv[1:]

with open('./blacklist','r') as f:
    blacklist = set(v.strip() for v in f.read().splitlines())

for record in json.loads(get('https://e621.net/post/index.json', data={'tags' : terms}).text):
    url = record['file_url']
    tags = record['tags'].split(' ')
    if blacklist.intersection(tags):
        print("Blacklisted")
    else:
        subprocess.run(['wget', url])

Requires 'blacklist' file (one tag per line) in CWD.

Python is a scary programming language. Most people say it's simple to learn, but it looks as difficult as Java. That's why I'm not an Android app developer (even though I really want to).

Updated by anonymous

Faux-Pa said:
Python is a scary programming language. Most people say it's simple to learn
, but it looks as difficult as Java. That's why I'm not an Android app developer (even though I really want to).

Dunno how seriously to take that given you can write Bash scripts, which I would consider way harder. The main differences between the above and your script is a) I factored out needless things (like, there is no need to make two requests to e621; the first request includes the tag info.), and b) I'm using very specific apis here that are very closely matched to the task, which often isn't possible in bash (for example requests.get handles HTTP requests, and so, it takes care of all the details of making a HTTP request, like properly escaping the search terms. Similarly json.loads is loading the JSON itself, rather than using a tool like jq to extract one field.)

There was also a bit of taste involved (I prefer to pass arguments on the commandline rather than popping up a dialog). And a bit of haste ;) (that's why some things that could be written as more steps, eg

lines = f.read().splitlines
    blacklist = set(l.strip() for l in lines)

are written as one line.

I would still bet that your bash script could be pared down to be even more compact than this Python script.

Updated by anonymous

Wow. I don't connect I am good at offering support. So you are all very good at this. Nice job every one!

Updated by anonymous

  • 1