Topic: Tools for better data metrics

Posted under e621 Tools and Applications

Are there any apps out there for e621 datamining? I’m curious about things like the average user’s fav count, a given users most favorited artist, that kind of thing.

Average fav count you can get from https://e621.net/stats. Just divide total favorites by number of users.
I don't think there's very many statistics available about any given user, but not too hard to calculate.

zerox3d said:
Are there any apps out there for e621 datamining? I’m curious about things like the average user’s fav count, a given users most favorited artist, that kind of thing.

You should write a library. I recommend limiting it to one user at a time for the favorited artist calculation. ;)
There's already a lot of data in the database export files like this. For analyzing your own account to give you some IRC stats bot-type output, this is actually not a bad idea.
"My top five artists", "5 most used tags", "most common tags on favorites", "most common rare tags" (i.e. sort the results of tags that only appear say, 20 times or less) <-- Might actually be nice to have if on Kitsunet or Digibase.

kora_viridian said:
You can DIY some of it with e621's dandy daily database dumps , although not everything you see on the interactive site is in the dump files.

You almost have to write your own software to deal with those dump files. A few of the files (aliases, implications) will fit in Excel, or in OpenOrifice if you swing that way. The tags file may still barely fit in Excel, or it may be a few rows too many by now. The posts file won't fit for sure; it has at least 3.5 million rows. Protip: use a CSV library; don't try to roll your own.

If you want to know something that you can only get from the interactive site, use the handy API - see the official directions or the unofficial ones . Note that e621 isn't kidding about User-Agent, and it helps a lot if you respect the rate limit as well.

Using a spreadsheet for a database. :D :d D:

kora_viridian said:
The other way around, "a database is just a really long spreadsheet, right?", is also popular. Only one table in the entire database, with bunches and bunches of columns. You'll see it more than once if you get paid to code, especially in corpo environments.

If you have data that's already CSV, TSV, fixed-columns, or similar, Excel (or OpenOrifice) isn't horrible, as long as you only use it to get an idea of the data you're dealing with. Everyone already has it and it takes a few seconds to import the CSV. You can then sort on the columns and see if you have any weird characters in the text columns, or if that column that *looks* all-numeric occasionally has letters in it, or if there are embedded commas/tabs that will require special handling, or similar. Then you can go write your import code, for sucking the CSV into MariaDB or SQLite or whatever, with greater confidence that you'll handle all the kinks correctly.

Yeah, using it as a debug tool for smaller datasets isn't so bad. I was just laughing at a spreadsheet so big that it ran into pointer limitations in Excel or the like. 3M+ rows is going to run like absolute dog****, right? Hidden cubic Big O complexity etc.

  • 1