Topic: DB Export Parsing Suite (for Python 3.10+)

Posted under e621 Tools and Applications

If you've ever wanted to do anything with the vast amount of data e621 holds, nothing compares to the daily database exports. Over the last year or so, I've slowly accumulated an extensive suite of homegrown software designed to ease parsing the idiosyncratic not-technically-CSV format the data is provided in, and today I voluntarily spent most of my Sunday rewriting most of what I had written and expanding its capabilities to handle every field available, even the boring ones. This is because I don't do drugs.

This supports every export currently available except the wiki. All functions are generator iterators which take a file handle, and return a dictionary object where the key names correspond to the first line of the file. The types of the values vary depending on the column. Some non-exhaustive notes:

  • Stuff like post IDs and fav counts are integers, or None where not applicable.
  • Datestamps use Python's datetime library, and also may be None for early entries.
  • Some fields have custom enums.
  • Tags are returned as a list of strings, and pool contents as a list of integers.
  • Boolean values (like is_deleted) may be True, False or None, because the early years of e621's database are full of Fun.

Some examples of usage follow in the replies.

e6csv.py
import sys
from enum import Enum, IntEnum
from datetime import datetime
import io

class Rating(IntEnum):
    SAFE = 0
    QUESTIONABLE = 1
    EXPLICIT = 2

ratingtable = {"s": Rating.SAFE, "q": Rating.QUESTIONABLE, "e": Rating.EXPLICIT}

class PoolCategory(IntEnum):
    SERIES = 0
    COLLECTION = 1

# ACTIVE is assigned 0 here to ease boolean comparisons, since in most cases
# it's the only one you care about
class RelationStatus(IntEnum):
    ACTIVE = 0
    PENDING = 1
    DELETED = 2
    APPROVED = 3
    QUEUED = 4
    PROCESSING = 5
    RETIRED = 6

class TagCategory(IntEnum):
    GENERAL = 0
    ARTIST = 1
    # 2 is unused
    COPYRIGHT = 3
    CHARACTER = 4
    SPECIES = 5
    INVALID = 6
    META = 7
    LORE = 8

Schema = Enum("Schema", "INT STR TIME MQUOTE SQUOTE RATING BOOL STRLIST " \
    + "INTLIST FLOAT PCAT RSTAT TCAT")

def commaorend(l, o):
    n = l.find(",", o)
    if n == -1:
        n = len(l)-1
    return n

def readcsv(f, schema):
    lineno = 1
    try:
        colnames = f.readline().strip().split(",")
        l = f.readline()

        # e621's string mangling algorithm works as follows:
        # If the first character is not a double quote, the string does not
        # contain any newlines, commas or double quotes, and is terminated by
        # the end of the field as normal.
        # If the first character is a double quote, the string continues until
        # the next unescaped double quote.
        # All characters are valid including commas and newlines, and double
        # quotes are escaped by following each one with a second double quote.
        # The string returned from this function replaces escaped double quotes
        # within the string with unescaped double quotes, and does not include
        # the enclosing double quotes (if any).
        def unescape(o, multiline):
            nonlocal l, lineno
            buf = io.StringIO()
            if l[o] != '"':
                n = commaorend(l, o)
                buf.write(l[o:n])
            else:
                while True:
                    o += 1
                    n = l.find('"', o)
                    while n == -1:
                        if not multiline:
                            raise Exception("Expected \"")
                        buf.write(l[o:])
                        l = f.readline()
                        lineno += 1
                        o = 0
                        n = l.find('"')
                    buf.write(l[o:n])
                    n += 1
                    if l[n] != '"':
                        break
                    buf.write('"')
                    o = n
            return (buf.getvalue(), n)

        while len(l) > 0:
            row = {}
            o = 0
            for col in zip(colnames, schema, strict=True):
                if o != 0:
                    o += 1
                if o == len(l):
                    raise Exception("Reached end of line unexpectedly")
                match col[1]:
                    case Schema.INT:
                        # Integer (may be blank)
                        n = commaorend(l, o)
                        v = int(l[o:n]) if o != n else None
                    case Schema.STR:
                        # String (cannot contain commas, newlines or quotes)
                        n = commaorend(l, o)
                        v = l[o:n]
                    case Schema.TIME:
                        # Datestamp (these SUCK)
                        # The microsecond part may not be present or may have
                        # fewer than 6 decimal places, which causes the
                        # datetime library to throw a wobbly.
                        # May also be blank.
                        n = commaorend(l, o)
                        if n == o:
                            v = None
                        else:
                            datestr = l[o:n]
                            if n-o == 19:
                                datestr += ".000000"
                            else:
                                datestr += "0" * (26-(n-o))
                            v = datetime.fromisoformat(datestr)
                    case Schema.MQUOTE:
                        # Arbitrary data (may contain anything including
                        # newlines)
                        v, n = unescape(o, True)
                    case Schema.SQUOTE:
                        # Arbitrary line (may contain anything except newlines)
                        v, n = unescape(o, False)
                    case Schema.RATING:
                        # Rating
                        # Must be one of "s", "q" or "e", and cannot be blank
                        n = commaorend(l, o)
                        v = ratingtable[l[o:n]]
                    case Schema.BOOL:
                        # Boolean
                        # Must be "t", "f", or blank
                        n = commaorend(l, o)
                        boolstr = l[o:n]
                        match boolstr:
                            case "t":
                                v = True
                            case "f":
                                v = False
                            case "":
                                v = None
                            case _:
                                raise Exception("Invalid bool: "+boolstr)
                    case Schema.STRLIST:
                        # Space-separated string list
                        v, n = unescape(o, False)
                        v = v.split(" ") if n != o else []
                    case Schema.INTLIST:
                        # Comma-separated integer list in curly brackets
                        # Also enclosed in double quotes unless the list has 1
                        # or 0 entries (i.e. no commas).
                        if l[o] == '"':
                            o += 1
                        if l[o] != "{":
                            raise Exception("Expected {")
                        o += 1
                        n = l.find("}", o)
                        if n == -1:
                            raise Exception("Expected }")
                        v = [int(s) for s in l[o:n].split(",")] if n != o else []
                        n += 1
                        if l[n] == '"':
                            n += 1
                    case Schema.FLOAT:
                        # Float (may be blank)
                        n = commaorend(l, o)
                        v = float(l[o:n]) if o != n else None
                    case Schema.PCAT:
                        # Pool category (cannot be blank)
                        n = commaorend(l, o)
                        v = PoolCategory[l[o:n].upper()]
                    case Schema.RSTAT:
                        # Relation status (cannot be blank)
                        n = commaorend(l, o)
                        v = RelationStatus[l[o:n].upper()]
                    case Schema.TCAT:
                        # Tag category (numeric, cannot be blank)
                        n = commaorend(l, o)
                        v = TagCategory(int(l[o:n]))
                    case _:
                        raise Exception("Unknown schema: "+str(col[1]))
                row[col[0]] = v
                o = n
            yield row
            l = f.readline()
            lineno += 1
    except:
        raise Exception("Error at line "+str(lineno)) from sys.exc_info()[1]

def readposts(f):
    return readcsv(f, [Schema.INT, Schema.INT, Schema.TIME, Schema.STR, \
        Schema.MQUOTE, Schema.RATING, Schema.INT, Schema.INT, Schema.STRLIST, \
        Schema.STRLIST, Schema.INT, Schema.STR, Schema.INT, Schema.INT, \
        Schema.INT, Schema.INT, Schema.INT, Schema.MQUOTE, Schema.FLOAT, \
        Schema.TIME, Schema.BOOL, Schema.BOOL, Schema.BOOL, Schema.INT, \
        Schema.INT, Schema.INT, Schema.BOOL, Schema.BOOL, Schema.BOOL])

def readpools(f):
    return readcsv(f, [Schema.INT, Schema.SQUOTE, Schema.TIME, Schema.TIME, \
        Schema.INT, Schema.MQUOTE, Schema.BOOL, Schema.PCAT, Schema.INTLIST])

def readrelations(f):
    return readcsv(f, [Schema.INT, Schema.SQUOTE, Schema.SQUOTE, Schema.TIME, \
        Schema.RSTAT])

def readtags(f):
    return readcsv(f, [Schema.INT, Schema.SQUOTE, Schema.TCAT, Schema.INT])

Updated

Given a posts CSV and a pools CSV (in that order), the following code will rank every one of Fuze's comics and other pools in descending order of average post score, in an attempt to determine once and for all what the all-time most popular Fuze comic is:

import sys
import os
import e6csv

print("Reading posts file...")

f = open(sys.argv[1], "r")
filesize = os.fstat(f.fileno()).st_size
linecount = 0
fuzeposts = {}

for post in e6csv.readposts(f):
    linecount += 1
    if linecount == 10000:
        linecount = 0
        print(str((f.tell()*100)/filesize)+"%            ",end="\r")
    if post['is_deleted'] or "fuze" not in post['tag_string']:
        continue
    fuzeposts[post['id']] = post['score']

f.close()
print("\nReading pools file...")

f = open(sys.argv[2], "r")
filesize = os.fstat(f.fileno()).st_size
linecount = 0
fuzepools = {}

for pool in e6csv.readpools(f):
    linecount += 1
    if linecount == 1000:
        linecount = 0
        print(str((f.tell()*100)/filesize)+"%            ",end="\r")
    posts = pool['post_ids']
    if len(posts) == 0 or posts[0] not in fuzeposts:
        continue
    totalscore = sum([fuzeposts[post] for post in posts])
    fuzepools[pool['name']] = totalscore / len(posts)

print()

for pool in sorted(fuzepools.items(), key=lambda p: p[1], reverse=True):
    print(str(pool[1])+" "+pool[0])

Note: This only works because none of Fuze's posts have ever been added to a pool containing the work of a different artist. A more cautious approach may be needed for other artists.

Note 2: This didn't work when I ran it because it turned out pool #12687 had a deleted post in it. Naughty.

Note 3: It's the incest comic, BTW. Wish I could say I was surprised.

Given a tags CSV and a posts CSV (in that order), the following code will produce an accurate post count of every tag in the invalid category (ignoring deleted posts) and name and shame all the tags for which the claimed tag count is the most inaccurate:

import sys
import os
import e6csv
from e6csv import TagCategory

class TagCount:
    pass

f = open(sys.argv[1], "r")
filesize = os.fstat(f.fileno()).st_size
linecount = 0
invtags = {}

for tag in e6csv.readtags(f):
    linecount += 1
    if linecount == 1000:
        linecount = 0
        print(str((f.tell()*100)/filesize)+"%            ",end="\r")
    if tag['category'] == TagCategory.INVALID:
        tc = TagCount()
        tc.claimed = tag['post_count']
        tc.actual = 0
        invtags[tag['name']] = tc

f.close()
print()

f = open(sys.argv[2], "r")
filesize = os.fstat(f.fileno()).st_size
linecount = 0
tags = {}

for post in e6csv.readposts(f):
    linecount += 1
    if linecount == 10000:
        linecount = 0
        print(str((f.tell()*100)/filesize)+"%            ",end="\r")
    if post['is_deleted']:
        continue
    for tag in post['tag_string']:
        if tc := invtags.get(tag):
            tc.actual += 1

f.close()
print()

for name, tc in sorted(invtags.items(), key=lambda i: abs(i[1].claimed - i[1].actual), reverse=True):
    if tc.claimed != tc.actual:
        print(str(abs(tc.claimed-tc.actual))+" "+name)

Given an implication CSV and an alias CSV (in that order), the following code will produce a list of every tag you could possibly write that would result in pokemon_(species) being automatically added to a post:

import sys
import os
import e6csv

f = open(sys.argv[1], "r")
tree = {}

for implication in e6csv.readrelations(f):
    if implication['status']:
        continue
    antecedent = implication['antecedent_name']
    consequent = implication['consequent_name']
    if implicators := tree.get(consequent):
        implicators.append(antecedent)
    else:
        tree[consequent] = [antecedent]

f.close()

def recurse(tags, implication):
    tags.add(implication)
    if implicators := tree.get(implication):
        for i in implicators:
            recurse(tags, i)

poketags = set()
recurse(poketags, "pokemon_(species)")

f = open(sys.argv[2], "r")

for alias in e6csv.readrelations(f):
    if alias['status']:
        continue
    if alias['consequent_name'] in poketags:
        poketags.add(alias['antecedent_name'])

f.close()

for t in poketags:
    print(t)

Ayo, this is kinda based, probably. I just skimmed your posts and might use this, because I'm wrangling the posts CSV myself right now. Thanks.

Since the tags aren't in quotes and I don't really care about the sources and description fields right now, but I really just want a file where every individual line is a post's data for easy grepping...

So earlier, I did this cursed business:

# --- thanks stack overflow (this processes lines that don't start with "^\d+,\d*,\d+-" (pcre) and tacks them onto the end of the previous; something like that)
awk '{printf (/^[0-9]+,[0-9]*,[0-9]+-/&&NR>1?RS:x)"%s ",$0} END {print ""}' ~/e6db/posts-2023-02-15.csv > ~/e6db/fixed-2023-02-15.csv

# --- surely nothing could go wrong here
sed 's/"[^"]*"//g' ~/e6db/fixed-2023-02-15.csv > ~/e6db/nostrings-2023-02-15.csv

# --- my metric for "it's probably a nice entry with trivial string contents" was whether or not there's 28 commas to delimit at.......
wc -l ~/e6db/fixed-2023-02-15.csv
awk '{x=$0}gsub(",","",x)==28' ~/e6db/nostrings-2023-02-15.csv | wc -l

# --- or if strings without commas are fine...
awk '{x=$0}gsub(",","",x)==28' ~/e6db/fixed-2023-02-15.csv > ~/e6db/almostnostrings-2023-02-15.csv
# --- for the rest, yank out all the strings once again so that (hopefully??) every line will always contain 28 commas
awk '{x=$0}gsub(",","",x)!=28' ~/e6db/fixed-2023-02-15.csv | sed 's/"[^"]*"//g' >> ~/e6db/almostnostrings-2023-02-15.csv
$ wc -l ~/e6db/posts-2023-02-15.csv ~/e6db/fixed-2023-02-15.csv ~/e6db/nostrings-2023-02-15.csv ~/e6db/almostnostrings-2023-02-15.csv
   10870056 ~/e6db/posts-2023-02-15.csv
    3865980 ~/e6db/fixed-2023-02-15.csv
    3865980 ~/e6db/nostrings-2023-02-15.csv
    3865980 ~/e6db/almostnostrings-2023-02-15.csv

Just kinda grepping the input file for roughly what entry starts look like, and for roughly what the ends of lines look like (three bools, mostly ig) I got to a point where I had the same number for those and in my "fixed" files, so I kinda consider the cursed business done and will use this as my CSV to do lookups on.

Maybe someone else will find my mildly cursed awk/sed magic useful at some point, idk!

[...] This is because I don't do drugs.

also lmao so true

Edit: Aaaand I've just realized that there's a few rare tags on e6 that have double quotes in them and so there's about 1200 or so posts where the tags will be quoted and get yanked in some of my clever filtering. Oh well.

Updated

emserdalf said:
Edit: Aaaand I've just realized that there's a few rare tags on e6 that have double quotes in them and so there's about 1200 or so posts where the tags will be quoted and get yanked in some of my clever filtering. Oh well.

Was going to point that out, but you beat me to it. Also, wait until you find the tags with commas in them (although currently none have any posts).

Thank you for this, the DB dumps are a mess.

Used it to find long post chains
import datetime
import os
import e6csv
import gzip
from typing import Dict, Tuple
import shutil


def download_export():
    print("Download starting")
    url = f"https://e621.net/db_export/posts-{datetime.datetime.today().strftime('%Y-%m-%d')}.csv.gz"
    # far faster than any Python-based method I could find
    os.system(f"curl {url} > db_export.csv.gz")
    print("Download complete, unzipping")
    with gzip.open("db_export.csv.gz") as f:
        with open("db_export.txt", "wb") as f_out:
            shutil.copyfileobj(f, f_out)
    print("Unzipping complete")


def parse_csv():
    posts: Dict[int, int] = dict()
    print("Parsing start")
    with open("db_export.txt", mode="rt", encoding="utf-8") as f:
        for post in e6csv.readposts(f):
            if post["parent_id"] is not None:
                posts[post["id"]] = post["parent_id"]
    print(f"Parsing complete. Posts with parents: {len(posts)}")
    return posts


def get_chains(posts: Dict[int, int]):
    print("Chains start")
    chains: Dict[int, Tuple[int, int]] = dict()
    for post_id, parent in posts.items():
        current = post_id
        chain = {post_id}
        while parent is not None and parent not in chain:
            current = parent
            chain.add(parent)
            parent = posts.get(parent, None)
        if (current not in chains or len(chain) > chains[current][0]) and len(chain) > 10:
            chains[current] = len(chain), post_id
    print("Chains end")
    return chains


def main():
    download_export()
    posts = parse_csv()
    chains = get_chains(posts)
    print("Writing output file")
    with open("output.txt", "w") as f:
        for length, chain in sorted(chains.values(), reverse=True):
            print(f"{chain} : {length}", file=f)
    print("Done")


if __name__ == '__main__':
    main()

wat8548 said:
If you've ever wanted to do anything with the vast amount of data e621 holds, nothing compares to the daily database exports. Over the last year or so, I've slowly accumulated an extensive suite of homegrown software designed to ease parsing the idiosyncratic not-technically-CSV format the data is provided in, and today I voluntarily spent most of my Sunday rewriting most of what I had written and expanding its capabilities to handle every field available, even the boring ones. This is because I don't do drugs.

The CSV format is pretty standard, though. Quote-doubling to escape quotation marks in text is fairly standard, and is interpreted without any issues by SQlite's .import command (.import --csv posts-2023-07-08.csv posts), Libre Office Calc (if you feel like running out of RAM), or even Python's built-in CSV reader:

def posts_csv_gen():
    with open('posts-2023-07-08.csv', newline='') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            yield row

If you're willing to download libraries built for that kind of data manipulation, Pandas or Polars are both great choices (though they might feel intimidating at first):

import pandas as pd

# https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes
e6_posts_types = {
    "id": int,
    "uploader_id": int,
    "created_at": str,
    "md5": str,
    "source": "string",  # nullable str
    "rating": "category",  # auto-detect enum
    "image_width": 'Int32',  # nullable int
    "image_height": 'Int32',  # nullable int
    "tag_string": str,
    "locked_tags": "string",  # nullable string
    "fav_count": int,
    "file_ext": "category",  # auto-detect enum
    "parent_id": "Int32",  # nullable int
    "change_seq": int,
    "approver_id": "Int32",  # nullable int
    "file_size": int,
    "comment_count": int,
    "description": "string",
    "duration": "Float32",  # nullable float
    "updated_at": str,
    "is_deleted": bool,
    "is_pending": bool,
    "is_flagged": bool,
    "score": int,
    "up_score": int,
    "down_score": int,
    "is_rating_locked": bool,
    "is_status_locked": bool,
    "is_note_locked": bool,
}

post_data = pd.read_csv(
    'posts-2023-07-08.csv.gz', compression='gzip',
    dtype=e6_posts_types, parse_dates=["created_at", "updated_at"],
    true_values=["t"], false_values=["f"], date_format="ISO8601"
)

print(post_data.nlargest(5, 'score'))
[/tr][/tr][/tr][/tr][/table]
iduploader_idcreated_atmd5sourceratingimage_widthimage_heighttag_stringlocked_tagsfav_countfile_extparent_idchange_seqapprover_idfile_sizecomment_countdescriptiondurationupdated_atis_deletedis_pendingis_flaggedscoreup_scoredown_scoreis_rating_lockedis_status_lockedis_note_locked
284061728486821128292021-07-26 20:52:16.1624150619ff5a8270aed7b22ca3981b783224https://www.zonkpunch.wtf/
https://www.furaffinity.net/...
e11386402021 2d_animation 3_toes 4_toes 5_fingers abs against_surface against_wall ahegao ... <NA>33232webm<NA>491708813575753152323406"Airlock Lust"281.7919922023-07-08 06:10:20.198667FalseFalseFalse1863519202-567TrueFalseFalse
233820623462301128292020-07-27 00:19:22.3699138cc585f75d4c0e909535d53a9541c9fehttps://zonkpunch.wtf e11026202020 2d_animation 3_toes <3 <3_eyes all_fours amber_eyes anal anal_penetration ... <NA>28027webm<NA>492029391877643909874297Enjoy, internet. I hope you like the switch from After Effects to Clip Studio.200.0399932023-07-08 07:13:20.605546FalseFalseFalse1509015330-240TrueFalseFalse
399773440059023965142023-04-21 17:50:45.0078710308f480274d60986d871335892ecd43https://dimwitdog.com/umbreedon/
https://www.newgrounds.com/...
e187010522023 2d_animation 4_toes 5_fingers &lt;3 after_anal after_sex after_vaginal ... <NA>27958webm<NA>496214843857163758946286<NA>277.8739932023-07-08 06:58:12.931268FalseFalseFalse1434014460-120FalseFalseFalse
260127626093251128292021-02-10 19:04:52.489237ab71a84c62b9d77fa0eb68c5e64dc165https://zonkpunch.wtf
https://www.furaffinity.net/...
e11026202021 2d_animation 4_fingers 4_toes 5_fingers &lt;3 &lt;3_eyes abdominal_bulge ... <NA>22699webm<NA>4950029313607950485923279"Saving The Species"230.0399932023-07-08 06:30:43.954312FalseFalseFalse1201012255-245TrueFalseFalse
230677623147971421602020-07-01 19:02:05.735966a51d55b12e7a16a37253783e52c4ec21https://inkbunny.net/s/2195419
https://www.furaffinity.net/...
e12008482020 2d_animation 3_toes age_difference anal anal_knotting anal_orgasm ... conditional_dnp -cub -young -shota23425webm<NA>492696483857120748665168Finished animation, with sound! 182.02023-07-08 05:00:14.034218FalseFalseFalse1197712179-202TrueFalseFalse

Well, it's been an embarrassingly long time, but I finally found the rather obvious bug in the rating code. OP updated, but if you already saved the file, just replace the one occurrence of "ratingstr" with "v".

  • 1