Topic: Ways to find tags often used together?

Posted under e621 Tools and Applications

Hello everyone,

I am currently trying to find a way to find posts that include the tags solo, male OR female, and the name of a character. I've got a list of character tags as a .txt file, and was wondering what the best way to find the number of posted images with this character and the tags is. I'm not looking to scrape the images themselves, just the number of images is enough.

To explain what I mean:

Let's say the .txt file only consisted of 3 characters, A, B and C.
I want to check how many images on e621 have the following tag combos:
A, solo, male
A, solo, female
B, solo, male
B, solo, female
C, solo, male
C, solo, female

I already grabbed the ~3 GB large file for all posts on the site, but that seems overkill. Also my attempt so far to reduce that file to only posts containing "solo" have been less than successful, with both Notepad++ and Excel refusing to work properly due to the large file size.

Is there a way to use the API to make these calls, preferably without me needing to enter every character by hand, and instead sourced from the .txt file? If so, a way to get started with using the API would be great, the Wiki page didn't help much besides letting me use my browser to get JSON files.

Any help would be appreciated, also sorry if my question seems incredibly dumb

There's unfortunately no API endpoint that provides a number of results for a search (issue #288) so the only option is to increment through every single page of the search listings (not advised) or use the db export.

topic #37429 has a tool to parse the CSV, though I can't personally endorse it because I haven't tried it.

Thanks for replying! Gonna check that one out.

This might not be e621 related, but once I extract all posts with the "solo" tag to a .txt, do you have any idea how I could go about comparing it with the character .txt?
Best case would be to use a script that extracts all posts with one of the combos described above, and outputs them to a new file. I could just count the amount of lines in these files to find out.

Otherwise I will still need to go through the characters one by one, and at that point I could probably just search for them on the site itself lol

Just wanted to thank you guys again for the help! I got DB Browser for SQLite, turned the .csv into a database, and have been working with it for a few days now due to some hickups like lost progress due to not saving changes to the database, trying to figure out how to use SQL in the first place, using LEFT OUTER JOIN to search through tag_string for the characters and all that (boy does that query take hours lol)

I got so far that I have a list now structured Character, female, male, ambiguous, with "solo" posts counted for each tab.
Have many characters with NULL in all columns, thanks to either misspellings (wolf_o'donnell was spelled wolf_odonnell, as an example) or the characters just not appearing solo at all. For these I am currently going through the list by hand, which I will come back to from time to time in order not to burn out lol.

At least I got to learn something from all this, though I am not sure if and when I will ever need SQL again lol

kora_viridian said:
Thanks for the update... the mad lad actually did it! :D

SQL gets used a lot in big corpo databases - stuff like customer data and billing. Normally only a few people can sit down in front of the database and just type in random SQL queries, but tons of automated SQL queries happen all the time. When you go to pay (say) your electric bill online, it's pretty likely that the web page you see has done something like SELECT due_date, amount_due FROM current_bills WHERE customer="superfid2006" to get the data.

I had a semester-long class in college about databases. You learned how to do a lot of it in a couple of weeks, because you wanted to know how much pr0n there is of different vidyagame characters. You may or may not be able to put that on your resume, but at least you learned something. :D

Kind of an aside: I've noticed that a lot of people write downloaders, taggers, and other tools to interact with e621's site and API. Part of that is because e621 makes the data available and documents it, and part of that is because people want to organize and optimize their pr0n viewing. It's also interesting because the size of e621's dataset is decidedly non-trivial; things that work fine on the typical comp-sci-class "sample database" that has 20 rows in it will take a week to run, or time out completely, on a database with over three and a half million rows in it. I think it's funny that a d-ld- company has inadvertently become a force for comp-sci education. :D

Thankfully a query takes only a day, not a week lol. I've made three seperate tables out of the .csv, all solo, and one for female, male and ambiguous (nevermind that I forgot intersex for example lmao), and then make a LEFT OUTER JOIN where each of these three is joined with the character list, using ON solo_female.ta_string like '%' || characterlist.names || '%'; same with the other three tables.
In testing with a smaller character list of 20, I found out that it completes much faster when using FROM solo_female, and joining the character list instead of the other way around.

Currently working through a second character list of around 50k entries, gonna join the resulting tables up with the other one and combine them in Excel (tried for a few hours to use SQLite, but only needed a few minutes with Excel so I'll probably do so again)

And about the characters that were not tagged 'solo' - they are usually entries with 50 posts at most, and considering I would have to go through around 12k entries by hand, I don't think I'm gonna bother. Maybe add a few examples from time to time, but that's a matter for later.

  • 1