Topic: Creating Python e621 Downloader!

Posted under General

it's not the best practice to have spaces in filenames.

you're using the worst string formatting in Python. f-strings are the future, but even good old %-formatting is better.

i know it's not a program that has to be ultimately fast, but you're concatenating strings very poorly. also the GET parameters probably should be added via urlencode
so for example

url = "https://e621.net/posts.json?tags=" # Generates first static part of e621.net api url
for x in tags: # Add all tags specified
    url += x
    url += "%20"
url += rating
url += "&limit={0}&callback=callback".format(postnum) # Add rest of url

should be

from urllib.parse import urlencode
# (...)
url = "https://e621.net/posts.json?" + urlencode({'tags': f"{' '.join(tags)} {rating}", 'limit': postnum})
# or 'tags': ' '.join(tags + [rating]) i guess

(what the hell is the callback=callback?)

later you're doing some ridiculous removing-urls-from-the-returned-string-and-then-recreating-the-list-of-urls-after-every-attempt just to leave the urls to base files? honestly, what the fuck? that's like a completely unnecessary O(n^2) right there

just
1. get a json object of the response
2. iterate over post entries
3. download the data from post['file']['url']

seriously, why do people overengineer simple shit?

shingen said:
it's not the best practice to have spaces in filenames.

you're using the worst string formatting in Python. f-strings are the future, but even good old %-formatting is better.

i know it's not a program that has to be ultimately fast, but you're concatenating strings very poorly. also the GET parameters probably should be added via urlencode
so for example

url = "https://e621.net/posts.json?tags=" # Generates first static part of e621.net api url
for x in tags: # Add all tags specified
    url += x
    url += "%20"
url += rating
url += "&limit={0}&callback=callback".format(postnum) # Add rest of url

should be

from urllib.parse import urlencode
# (...)
url = "https://e621.net/posts.json?" + urlencode({'tags': f"{' '.join(tags)} {rating}", 'limit': postnum})
# or 'tags': ' '.join(tags + [rating]) i guess

(what the hell is the callback=callback?)

later you're doing some ridiculous removing-urls-from-the-returned-string-and-then-recreating-the-list-of-urls-after-every-attempt just to leave the urls to base files? honestly, what the fuck? that's like a completely unnecessary O(n^2) right there

just
1. get a json object of the response
2. iterate over post entries
3. download the data from post['file']['url']

seriously, why do people overengineer simple shit?

Just because you're better at programming doesn't mean you have to be an asshole about it.

shingen said:
it's not the best practice to have spaces in filenames.

you're using the worst string formatting in Python. f-strings are the future, but even good old %-formatting is better.

i know it's not a program that has to be ultimately fast, but you're concatenating strings very poorly. also the GET parameters probably should be added via urlencode
so for example

url = "https://e621.net/posts.json?tags=" # Generates first static part of e621.net api url
for x in tags: # Add all tags specified
    url += x
    url += "%20"
url += rating
url += "&limit={0}&callback=callback".format(postnum) # Add rest of url

should be

from urllib.parse import urlencode
# (...)
url = "https://e621.net/posts.json?" + urlencode({'tags': f"{' '.join(tags)} {rating}", 'limit': postnum})
# or 'tags': ' '.join(tags + [rating]) i guess

(what the hell is the callback=callback?)

later you're doing some ridiculous removing-urls-from-the-returned-string-and-then-recreating-the-list-of-urls-after-every-attempt just to leave the urls to base files? honestly, what the fuck? that's like a completely unnecessary O(n^2) right there

just
1. get a json object of the response
2. iterate over post entries
3. download the data from post['file']['url']

seriously, why do people overengineer simple shit?

Ok, 1st, I dont like using urllib, thats why im using requests instead (I find it easier to use)
2nd: I just saw what some other website was using to contact the API and they had callback=callback so idk.
3rd: I attempted to use json but struggled a bit (I'm still not an expert at python) so I decided to do it my way, it gets the job done.

bitwolfy said:
No, I'm pretty sure that's a requirement.

No, I'm pretty sure constructive critisism would be more useful. Not straight up flaming me.

jezzar said:
Ok, 1st, I dont like using urllib, thats why im using requests instead (I find it easier to use)
2nd: I just saw what some other website was using to contact the API and they had callback=callback so idk.
3rd: I attempted to use json but struggled a bit (I'm still not an expert at python) so I decided to do it my way, it gets the job done.

1: that's up to you. Just remember, if you never try new things, you'll never improve as a coder, and your code will be stuck at beginner level.

2: copy pasting code is a great way to make parts of your code work, and a crappy way to improve as a programmer. Try to understand the code you are using, so that you're not stuck with other peoples' flaws and limitations.

3: once again, make sure that you're comfortable with your own code, just doont forget to learn new things :)

mynameisover20charac said:
1: that's up to you. Just remember, if you never try new things, you'll never improve as a coder, and your code will be stuck at beginner level.

2: copy pasting code is a great way to make parts of your code work, and a crappy way to improve as a programmer. Try to understand the code you are using, so that you're not stuck with other peoples' flaws and limitations.

3: once again, make sure that you're comfortable with your own code, just doont forget to learn new things :)

Thanks for that, I kinda just wanted to get this finished quickly so I could use it. I'll probably make something better when i've learnt more. I also struggled to find any info on the API so that was the only thing I really had.

Updated

jezzar said:
Ok, 1st, I dont like using urllib, thats why im using requests instead (I find it easier to use)
2nd: I just saw what some other website was using to contact the API and they had callback=callback so idk.
3rd: I attempted to use json but struggled a bit (I'm still not an expert at python) so I decided to do it my way, it gets the job done.

1. i'm not telling you to use urllib overall, just the urlencode function or an equivalent. the space is not the only thing that needs to be escaped, so unless you're ready to create a function doing it properly for nearly all possible characters yourself, it's better to just use a function that someone made already. and using it is much more comfortable than making sure you've added all appropriate '&' between all parameters manually.
it's a sign of a good programmer to know when you have a specific task that you have to code yourself, and when it's generic enough that "someone must've done it already".
maybe requests has something similar, check it out.
3. json object, after "decoding" from string, should be either a regular Python list or a regular Python dictionary (with more dicts/lists inside). if requests is creating something more complicated that you can't use in the same way, then maybe it's not the best choice.

jezzar said:
No, I'm pretty sure constructive critisism would be more useful. Not straight up flaming me.

i wasn't flaming you, i told you that you're doing something sub-optimally, and what to better use instead. the sooner you and everyone else get used to better methods the better for everyone.
i'm sorry i'm not treating you like a retarded child, but like an adult, lol.
regarding the 'later' part, i'm honestly just amazed what process took place to end with such an algorithm.
even when you don't know how to do it with a json-decoded object, when you browse through the returned string the way you do, and you find some url, and you check if it's a sample or a preview, what's the purpose of removing it from the source? just ignore it, and go on. once you find a url to the actual full file just save it in some kind of a separate list, and once you finish just go over that list and download the files. (or you could download it once you find it, but that is just making the entire loop unnecessarily complicated)
focus on doing what you have to do, because doing additional tasks is usually just a source of wasted time and possible mistakes.

Pup

Privileged

jezzar said:
Hello!
Since all downloaders are broken at the moment (Because of the API update) I've decided to start making my own!
I will post all updates to GitHub so stay tuned there to see how progress is going!
https://github.com/JezzaR-The-Proto/e621_Python_Downloader

Since you're using the requests library, I thought I'd share some code that might be a better way of doing it and make things a bit easier:

import requests
from requests.auth import HTTPBasicAuth
import time
import os

thisFolder = os.path.dirname(os.path.realpath(__file__))
apikeyFilePath = thisFolder + os.sep + 'apikey.txt'

postsURL='https://e621.net/posts.json'
searchString = 'fav:x'

returnedPostLimit = 320
lowestID = -1
stop = False

headers = {'User-Agent':'E6_Post_Downloader/1.0 (by Jezzar on E621)'}

params = {'tags':searchString}

rateLimit = 1
lastTimeValue = time.time()
def rateLimitThread():
    global lastTimeValue
    elapsedTime = time.time() - lastTimeValue
    if elapsedTime <= rateLimit:
        time.sleep(rateLimit-elapsedTime)
    lastTimeValue = time.time()


try:
    with open(apikeyFilePath) as apiFile:
        apiTxt = apiFile.read().splitlines()
except FileNotFoundError:
    with open(apikeyFilePath, 'a') as apiFile:
        apiFile.write("username=" + os.linesep + "api_key=")
    print("apikey.txt created - please add your username and api key")
    exit()

apiUsername = apiTxt[0].split('=')[1].strip()
apiKey = apiTxt[1].split('=')[1].strip()

session = requests.Session()

rateLimitThread()
postsResponse = session.get(postsURL, headers=headers, params=params, auth=HTTPBasicAuth(apiUsername, apiKey))
returnedJSON = postsResponse.json()

if postsResponse.status_code != 200:
    print (postsResponse.json())
    exit()

while stop == False:
    if len(returnedJSON['posts'] < returnedPostLimit):
        stop == True
        
    for post in postsResponse['posts']:
        (code to download the images, including HTTPBasicAuth)
        
        if lowestID > post['id'] or lowestID == -1:
            lowestID = post['id']
    
    params = {'tags':searchString, 'page':'b' + lowestID}
    
    rateLimitThread()
    postsResponse = session.get(postsURL, headers=headers, params=params, auth=HTTPBasicAuth(apiUsername, apiKey))
    returnedJSON = postsResponse.json()
    
    if postsResponse.status_code != 200:
        print (postsResponse.json())
        exit()

Quick edit:
You need the BasicAuth thing as otherwise it'll error on any posts blocked by the default blacklist.
And using post['file']['url'] can be a lot simpler and easier than having to format the returned JSON string manually.

Updated

shingen said:
1. i'm not telling you to use urllib overall, just the urlencode function or an equivalent. the space is not the only thing that needs to be escaped, so unless you're ready to create a function doing it properly for nearly all possible characters yourself, it's better to just use a function that someone made already. and using it is much more comfortable than making sure you've added all appropriate '&' between all parameters manually.
it's a sign of a good programmer to know when you have a specific task that you have to code yourself, and when it's generic enough that "someone must've done it already".
maybe requests has something similar, check it out.
3. json object, after "decoding" from string, should be either a regular Python list or a regular Python dictionary (with more dicts/lists inside). if requests is creating something more complicated that you can't use in the same way, then maybe it's not the best choice.

i wasn't flaming you, i told you that you're doing something sub-optimally, and what to better use instead. the sooner you and everyone else get used to better methods the better for everyone.
i'm sorry i'm not treating you like a retarded child, but like an adult, lol.
regarding the 'later' part, i'm honestly just amazed what process took place to end with such an algorithm.
even when you don't know how to do it with a json-decoded object, when you browse through the returned string the way you do, and you find some url, and you check if it's a sample or a preview, what's the purpose of removing it from the source? just ignore it, and go on. once you find a url to the actual full file just save it in some kind of a separate list, and once you finish just go over that list and download the files. (or you could download it once you find it, but that is just making the entire loop unnecessarily complicated)
focus on doing what you have to do, because doing additional tasks is usually just a source of wasted time and possible mistakes.

Sorry for thinking you were flaming me. It just seemed that you were quite annoyed at my primative way of doing things.

pup said:
Since you're using the requests library, I thought I'd share some code that might be a better way of doing it and make things a bit easier:

What's the advantage of HTTPBasicAuth over just passing those keys as 'key=value' parameters (or via the 'headers' parameter to .get / .post)? Reliable escaping or what?

pup said:
Since you're using the requests library, I thought I'd share some code that might be a better way of doing it and make things a bit easier:

import requests
from requests.auth import HTTPBasicAuth
import time
import os

thisFolder = os.path.dirname(os.path.realpath(__file__))
apikeyFilePath = thisFolder + os.sep + 'apikey.txt'

postsURL='https://e621.net/posts.json'
searchString = 'fav:x'

returnedPostLimit = 320
lowestID = -1
stop = False

headers = {'User-Agent':'E6_Post_Downloader/1.0 (by Jezzar on E621)'}

params = {'tags':searchString}

rateLimit = 1
lastTimeValue = time.time()
def rateLimitThread():
    global lastTimeValue
    elapsedTime = time.time() - lastTimeValue
    if elapsedTime <= rateLimit:
        time.sleep(rateLimit-elapsedTime)
    lastTimeValue = time.time()


try:
    with open(apikeyFilePath) as apiFile:
        apiTxt = apiFile.read().splitlines()
except FileNotFoundError:
    with open(apikeyFilePath, 'a') as apiFile:
        apiFile.write("username=" + os.linesep + "api_key=")
    print("apikey.txt created - please add your username and api key")
    exit()

apiUsername = apiTxt[0].split('=')[1].strip()
apiKey = apiTxt[1].split('=')[1].strip()

session = requests.Session()

rateLimitThread()
postsResponse = session.get(postsURL, headers=headers, params=params, auth=HTTPBasicAuth(apiUsername, apiKey))
returnedJSON = postsResponse.json()

if postsResponse.status_code != 200:
    print (postsResponse.json())
    exit()

while stop == False:
    if len(returnedJSON['posts'] < returnedPostLimit):
        stop == True
        
    for post in postsResponse['posts']:
        (code to download the images, including HTTPBasicAuth)
        
        if lowestID > post['id'] or lowestID == -1:
            lowestID = post['id']
    
    params = {'tags':searchString, 'page':'b' + lowestID}
    
    rateLimitThread()
    postsResponse = session.get(postsURL, headers=headers, params=params, auth=HTTPBasicAuth(apiUsername, apiKey))
    returnedJSON = postsResponse.json()
    
    if postsResponse.status_code != 200:
        print (postsResponse.json())
        exit()

Quick edit:
You need the BasicAuth thing as otherwise it'll error on any posts blocked by the default blacklist.
And using post['file']['url'] can be a lot simpler and easier than having to format the returned JSON string manually.

Thanks for this, I'll use parts to help but wont copy it exactly (wouldn't want to plagarise!).
Also, for "this folder" would os.getcwd() do the same?

Updated

Pup

Privileged

savageorange said:
What's the advantage of HTTPBasicAuth over just passing those keys as 'key=value' parameters (or via the 'headers' parameter to .get / .post)? Reliable escaping or what?

It's mostly so they don't end up in log files, specifically for GET requests that need authorisation, as params={'login':x} would add ?login=x to the end of the url.

Code wise, at least with the requests library, it means you don't need to add it to two different parts, headers for POST and params for GET.

They're admittedly not big differences, but it's the more "best practice" way to do it, and the way that's encouraged to use for API authorisation.

Pup

Privileged

jezzar said:
Thanks for this, I'll use parts to help but wont copy it exactly (wouldn't want to plagarise!).
Also, for "this folder" would os.getcwd() do the same?

Feel free to use as much or as little as you like, I just thought it might help to see a different way of doing it.

I could be wrong, but with os.getcwd(), if you had the terminal open in ~/folderA/ and ran python3 ../folderB/test.py, I think it might return folderA, as that's where the process is called from, and not where the script is.

Also, I feel I should mention the E6 Discord, there's a channel for tech stuff if you ever want more help, there's a link at the top of the page on the navbar.

Quick edit:
for post in postsResponse['posts']:
Should be:
for post in returnedJSON['posts']:

And:
params = {'tags':searchString, 'page':'b' + lowestID}
Should be:
params = {'tags':searchString, 'page':'b' + str(lowestID)}

Updated

pup said:
I could be wrong, but with os.getcwd(), if you had the terminal open in ~/folderA/ and ran python3 ../folderB/test.py, I think it might return folderA, as that's where the process is called from, and not where the script is.

Yep you are exactly correct. Just tried it.

Pup

Privileged

jezzar said:
What does the "page" parameter actually do?

It's on e621:api, but page=2 would return the second page of your search, just like navigating between pages on E6, with limit being how many posts are on a page. However for going through posts it's quite bad, as if someone adds a new post then page 2 will have the last post from page 1 on it as well, and if a post gets deleted you'll miss a post. It also only goes up to 750 pages.

So instead of page=2 you can use page=b2 which swaps it to "every post before postID 2". Then page=a2 would give posts after postID 2.

So the code returns a list of posts, say the lowest ID is 2000. The next time it looks for posts it'll say "get every post that matches the search with an ID below 2000", if it's 75 posts per page, and no posts are deleted, the lowest ID will be 1925, and on the next loop it'll say "get every post that matches the search with an ID below 1925". It essentially loops on post ID, so you eventually go through every post in your search.

Right, I've finished rewriting it and have published it all to the github (including an exe version so you don't have to install requests).
Tell me if anything is still poorly written.

Pup

Privileged

jezzar said:
Right, I've finished rewriting it and have published it all to the github (including an exe version so you don't have to install requests).
Tell me if anything is still poorly written.

I appreciate it, but I'd prefer it if I wasn't in your user-agent header, it's your program after all.
If you really want to include that I helped, just have it as a comment :)

Then just going down the code and mentioning things I noticed, there's a few, but they're mostly all small fixes:

It says "press enter to continue", but then pressing enter moves to exit(), closing it.

Unless I'm missing something, apiKey = apiKeys[2] should be apiKey = apiKeys[1].

Some tags use a semicolon, so really space delimited would be better, then it'd probably be easier to just put them in the params rather than add them to the URL manually.

Rather than the for loop to join tags, you can use X = " ".join(tags) to join them together with a space between each tag.

For req.status_code != 200 you should probably show the JSON as well, as it can contain info about the error.

With using pastURL and postURL, page=b<id> shouldn't return the same post twice, if it does you might need to use "page=b" + str(lowestID - 1) or something.

Rather than parsing the MD5 and extension you can just use data['file']['md5'] + data['file']['ext'].

You shouldn't need to the cwd variable if you've got currentFolder already declared with the same value.

And lastly, the last bit checking for a 200 status code doesn't exit on a non-200 code.

One thing that might be awkward to add, but useful, is to not re-download already downloaded posts. You'd need to get a list of filenames, remove the extensions, then do something like:
if data['file']['md5'] not in downloadedPosts:
for each post.

The only other thing I can think of to add at the moment is, instead of "\\downloads\\", to use os.path.sep + downloads + os.path.sep, so that it'll work on Linux as well as Windows.

And best of luck with it all.

pup said:
I appreciate it, but I'd prefer it if I wasn't in your user-agent header, it's your program after all.
If you really want to include that I helped, just have it as a comment :)

Then just going down the code and mentioning things I noticed, there's a few, but they're mostly all small fixes:

It says "press enter to continue", but then pressing enter moves to exit(), closing it.

Unless I'm missing something, apiKey = apiKeys[2] should be apiKey = apiKeys[1].

Some tags use a semicolon, so really space delimited would be better, then it'd probably be easier to just put them in the params rather than add them to the URL manually.

Rather than the for loop to join tags, you can use X = " ".join(tags) to join them together with a space between each tag.

For req.status_code != 200 you should probably show the JSON as well, as it can contain info about the error.

With using pastURL and postURL, page=b<id> shouldn't return the same post twice, if it does you might need to use "page=b" + str(lowestID - 1) or something.

Rather than parsing the MD5 and extension you can just use data['file']['md5'] + data['file']['ext'].

You shouldn't need to the cwd variable if you've got currentFolder already declared with the same value.

And lastly, the last bit checking for a 200 status code doesn't exit on a non-200 code.

One thing that might be awkward to add, but useful, is to not re-download already downloaded posts. You'd need to get a list of filenames, remove the extensions, then do something like:
if data['file']['md5'] not in downloadedPosts:
for each post.

Firstly, apiKeys[1] (for some reason) returns a blank string?
Next, I tried using your params method first but it didnt use them as tags? It just returned the latest posts.
The pastURL & postURL is really just a failsafe. It should never happen but if it does, its ok with it.
The whole not re-downloaded thing shouldn't be too hard to implement.
Thanks for all you help!

Would you be able to add an option to title posts as "postid.md5"? Such as 20000.53e138a4e9bab643e6d33e99524af01e for the postid 20000? Thanks in advance.

  • 1