Topic: E621.net Site Rip Torrent version soon

Posted under General

I am currently purging/backing up the hole website onto my server i am around 50% done and will upload a torrent version of the hole website compressed(x6 bit map) i have already around 2 mill pictures at 60GB will seed at 100mb/s for a couple of weeks when it done who is keen

Updated by e621 name

ILOVEBLOOD said:
I am currently purging/backing up the hole website onto my server i am around 50% done and will upload a torrent version of the hole website compressed(x6 bit map) i have already around 2 mill pictures at 60GB will seed at 100mb/s for a couple of weeks when it done who is keen

why?

Updated by anonymous

Every site dies eventually. You can really only trust data that you're storing yourself. Even data you store yourself is unreliable, but at least you have control over that unreliability.

What do you mean by "compressed(x6 bit map)"?

Updated by anonymous

Conker said:
why?

Because there is going to be a big removal from e621 soon and people will loose artwork they love so i'm making a backup

Updated by anonymous

Wyvrn said:
Every site dies eventually. You can really only trust data that you're storing yourself. Even data you store yourself is unreliable, but at least you have control over that unreliability.

What do you mean by "compressed(x6 bit map)"?

Becuase i have 128Gb of ram i can compress it 6 times per byte making it 6-10 times smaller then what it would be

Updated by anonymous

ILOVEBLOOD said:
Becuase i have 128Gb of ram i can compress it 6 times per byte making it 6-10 times smaller then what it would be

post #116508

Updated by anonymous

Images are already compressed, you can't losslessly compress them further by more than a few percent. No matter how much ram you have.

Updated by anonymous

i could still roughly cut it down by a lot because no one is going to download 200gb+ of furry porn in on sitting. Also i'm Ripping the server versions most pictures will be 1mb+ each

Updated by anonymous

I personally wouldn't want to download a torrent like this unless I was getting the full-resolution source quality images, with metadata.

If you want to cut down the size of the download, maybe split it up into several torrents, ordered by date of upload (post ID). That would make it easy to update the rip occasionally by just posting another torrent with the most recently uploaded pictures.

Updated by anonymous

As I said in the last e621 torrent thread...

I dont think his PC has enough memory for all dat COCK
post #106282

Updated by anonymous

Well i do run servers for work each tower will support to around 500Gb IF needed but all the pictures on the website will only be around 240gb which isn't to much data also your a crab your point is invaild

Updated by anonymous

ILOVEBLOOD said:
Because there is going to be a big removal from e621 soon and people will loose artwork they love so i'm making a backup

THE GREAT PURGE IS COMING!!!!!111

Updated by anonymous

You mean, something like this?

http://imgur.com/Jtp3Yta

I coded this yesterday. A simple shell script which runs on a Busybox httpd server in my phone. Porn on the go!

I have a tiny PHP script which, based on a search ("fav:lizardite" for me), downloads every picture and its tags, and generates a 200x200px thumbnail for the search script.

Updated by anonymous

All my program dose is searches for images with https://e621.net and goes to the original source downloads it and moves along there will be no copy
s and every photo will be as high rez as the wesbite

Updated by anonymous

ILOVEBLOOD said:
Because there is going to be a big removal from e621 soon and people will loose artwork they love so i'm making a backup

Yeah, where's your proof, new guy? :/

Updated by anonymous

ILOVEBLOOD said:
Because there is going to be a big removal from e621 soon and people will loose artwork they love so i'm making a backup

what

Updated by anonymous

ILOVEBLOOD said:
Because there is going to be a big removal from e621 soon and people will loose artwork they love so i'm making a backup

What are you talking about?

There's a 342 image takedown waiting to happen but I wouldn't call 0.1% of our content big.

Sad, but not big.

Updated by anonymous

Is 4chan coming to raid?

Cause I wanna livestream their antics

Updated by anonymous

So I've liked this idea and I'm also dumping every picture.

I am however using a different approach. I am not downloading this entire website using a ready-made downloaded like HTTrack, because it also downloads useless stuff.

I've coded a multithreaded bash script (mostly for practice, as I am quite bad at *nix shell) which downloads each post (using "/post/show/[NUM]/") to my HDD, along with its MD5 sum and a list its tags. If a post is deleted, it saves a "deleted.nx" file, so the script doesn't attempt to download it again.

The structure is also quite simple:

Dir "data":
   Dir "00000013":
      File "deleted.nx" <- empty file, this post existed but was removed
   Dir "00000014":
      File "image.jpg" <- picture
      File "tags.txt" <- a list of tags, one per line
      File "md5.txt" <- image checksum

Finally this is a capture of the debug messages:

[03] [00004694] Local file: ./data/00004694/file.jpg
[08] [00004889] Image URL:
[04] [00004895] Image URL: https://static1.e621.net/data/69/97/69972a822527db6db8f74a9f30c9f985.jpg
[04] [00004895] Local file: ./data/00004895/file.jpg
[08] [00004889] Post deleted
[00] [00004881] Image URL: https://static1.e621.net/data/59/8b/598baf3a7c0d54ebc457604268d3d9d1.jpg
[00] [00004881] Local file: ./data/00004881/file.jpg
[01] [00004822] Success
[06] [00004937] Success
[07] [00004928] Image URL: https://static1.e621.net/data/e9/b7/e9b719b151039a4310c4febcca9755c5.jpg
[07] [00004928] Local file: ./data/00004928/file.jpg
[02] [00004823] Success
[08] [00004899] Image URL: https://static1.e621.net/data/27/11/2711d57afa6b668cff30e13ef4dd0572.gif
[08] [00004899] Local file: ./data/00004899/file.gif
[09] [00004860] Success
[05] [00004826] Success
[01] [00004832] Image URL:
[01] [00004832] Not created yet
[06] [00004947] Image URL: https://static1.e621.net/data/e5/1a/e51a5eb746debf0cab4e1be7b74613ca.jpg
[06] [00004947] Local file: ./data/00004947/file.jpg
[03] [00004694] Success

First number is thread ID, then post ID, then the message

It's going slow, but I will attempt to upload a dump of the first 100000 pictures when it finishes.

Updated by anonymous

Just try to keep the requests at a reasonable rate, no more than an image a second or two.

Updated by anonymous

Slowly is the nicer way to do it. Even at 1 hit/second, you could dump the whole site in just a few days, so there's no rush.

Updated by anonymous

tony311 said:
Just try to keep the requests at a reasonable rate, no more than an image a second or two.

oh god I didn't even think about the strain these antics could cause the servers alongside a regular day. Is this why the servers overflowed yesterday?

Updated by anonymous

Sollux said:
oh god I didn't even think about the strain these antics could cause the servers alongside a regular day. Is this why the servers overflowed yesterday?

News - Jan 07, 2014
e621 is back up and running! The database ran out of memory so we quadrupled it. We apologize for any inconvenience. Have a nice day!

Updated by anonymous

Xch3l said:

yeah but earlier in that morning I was getting 500 errors sporadically.

Updated by anonymous

Sollux said:
yeah but earlier in that morning I was getting 500 errors sporadically.

Yeah, because of the RAM issue :)

Updated by anonymous

tony311 said:
Just try to keep the requests at a reasonable rate, no more than an image a second or two.

Don't worry, I couldn't hit that rate even if I wanted to. I'm dowloading at less than 0.8 files per second.

Note that I am not only downloading the files, but also extracting tags and checking the MD5 of each picture.

Updated by anonymous

The next few pictures I make are all going to have the same MD5 hash, to mess with anyone who tries to upload them here :P

Updated by anonymous

Wyvrn said:
The next few pictures I make are all going to have the same MD5 hash, to mess with anyone who tries to upload them here :P

Good luck with that. Besides the fact that this is nigh impossible below terabyte big files you're simply incapable of uploading the second image, the server simply declines anything with a matching MD5 that already exists.

Updated by anonymous

Good luck drawing 340282366920938463463374607431768211456 pictures until a MD5 hash repeats.

Updated by anonymous

You guys haven't been keeping up to date with just how broken MD5 is. There's a windows GUI program to make conflicting chosen-prefix files now

Behold, two furry porns with the same MD5

Currently if you try to upload an image to e621 with an MD5 identical to an existing post, it prevents the upload, takes you to the clashing post, and adds all the tags you entered for the image you were uploading to the clashing post. If the two images were different, the existing post will be mistagged. Not a catastrophic failure, but still a good example of why in this day and age nobody should be relying on MD5 to tell differing images apart.

Updated by anonymous

Wyvrn said:
You guys haven't been keeping up to date with just how broken MD5 is. There's a windows GUI program to make conflicting chosen-prefix files now

Behold, two furry porns with the same MD5

Currently if you try to upload an image to e621 with an MD5 identical to an existing post, it prevents the upload, takes you to the clashing post, and adds all the tags you entered for the image you were uploading to the clashing post. If the two images were different, the existing post will be mistagged. Not a catastrophic failure, but still a good example of why in this day and age nobody should be relying on MD5 to tell differing images apart.

I knew it was broken to the extend that some people have managed to forge an RSA certificate with a certain MD5, using a cluster of 200 PS3, but I didn't know it was THAT broken you could do it at home.

IMHO e621 should store the pictures based on the post ID and use the MD5 only as a faster lookup method (ie looking for pictures whose hash matches the picture we're attempting to upload, and then, if any, compare the bytes of matching pictures), or use another hash (AFAIK SHA1 is fine)

Updated by anonymous

Changing hashes just extends the time until that one gets broken. I really don't think this is anywhere near as big a deal as you guys are making it out to be in relation to e6. For general usage, sure, but not here.

Updated by anonymous

Yeah, that SSL CA attack was really spectacular, it was actually done by the same guy that wrote Hashclash. The reason it took 200 PS3s was that they were only able to add 2048 bits to the file to create a hash collision that would fit into the RSA certificate format. The more 512-bit blocks you're allowed to add, the less computationally expensive it is.

There are plenty of hash functions with no known attacks though, modern crypto relies on them. SHA-1 isn't as broken as MD5, but there are still known weaknesses, it would be better to use SHA-2.

Comparing the bytes of a file when hashes collide is a good idea, and a perfect future-proofing against SHA-2 being broken. MD5 could still be exposed to the user as a convenient cross-site identifier that is almost always unique, but internally if the site is relying on hashes to be unique, it should be using a different function.

Updated by anonymous

Wyvrn said:
Yeah, that SSL CA attack was really spectacular, it was actually done by the same guy that wrote Hashclash. The reason it took 200 PS3s was that they were only able to add 2048 bits to the file to create a hash collision that would fit into the RSA certificate format. The more 512-bit blocks you're allowed to add, the less computationally expensive it is.

There are plenty of hash functions with no known attacks though, modern crypto relies on them. SHA-1 isn't as broken as MD5, but there are still known weaknesses, it would be better to use SHA-2.

Comparing the bytes of a file when hashes collide is a good idea, and a perfect future-proofing against SHA-2 being broken. MD5 could still be exposed to the user as a convenient cross-site identifier that is almost always unique, but internally if the site is relying on hashes to be unique, it should be using a different function.

Or just store them on a per-ID basis, and use the hash only to search for matches. That way if a hash gets broken it's a matter of updating the database only.

Updated by anonymous

wtf nigga why there single quotes in tags, that just ain't right

Additional information: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'n_dale_rescue_rangers')' at line 1

Updated by anonymous

But don't worry I am properly escaping it now.

If anyone cares, I am using a relational database to save all the images. That makes me wonder, how are the images saved on the e621 server?

The scheme goes like this:

imagelist
---imageID (=e621 image ID)
---image (binary blob)
taglist
---tagID (autoincrement)
---tag
tagmap
---imageID
---tagID

It's single threaded and definitely less than 1 image per second.
This will take ~ a week I guess.

Updated by anonymous

Wait, you're storing the image files in the database?

Updated by anonymous

Wyvrn said:
Wait, you're storing the image files in the database?

Speed is negligible and fragmentation is no issue since at this point I am only using last gen SSDs, which leaves me with the advantages of portability, easier replication/backup possibilities and allow for direct access to search & images.

Updated by anonymous

All i'm doing is using my 100mb/s i'm downloading off the server using the a custom version of Google hacks which takes every photo off there cloud/local server and downloads them meaning full quality and full image then i'm packing it into 2 or 3 torrents and seeding until enough people have it the current size estimate is somewhere between 100gb per torrent meaning around 300Gb I will try and remove some of the Extremely weird photo's in my views and any Clean photo's Keep in mind i won't be able to get to all of them to many. Will keep people noted until its finished then i'll be seeding at 20-80Mb/s

Updated by anonymous

Don't bother taking stuff out. The main appeal in downloading a torrent like this is if it's a full rip of e621, not just a curated subsection.

Updated by anonymous

I'll consider it but there is a lot of content that no one will enjoy i might make 2 versions one with stuff cut out and one with a full site rip but that is almost 1000K+ photo's enough porn for life haha

Updated by anonymous

Also for people you don't know One I've used this site for 3 years and didn't need an account two my name "ILOVEBLOOD" is a gemic i've used in all my MmoRpg games for the last 13 years and keep for tradition sake

Updated by anonymous

ILOVEBLOOD said:
Also for people you don't know One I've used this site for 3 years and didn't need an account two my name "ILOVEBLOOD" is a gemic i've used in all my MmoRpg games for the last 13 years and keep for tradition sake

if you split it up you could always have one part be all explicit, and one be all safe/questionable and the third part be all of the thumbnails and metadata, if that's possible.

Updated by anonymous

I would but one each torrent would be 300Gb Two that's a lot of time sorting thought over 1000k photo's and three people would not just download clean versions of the website they would want a full site rip

Updated by anonymous

ILOVEBLOOD said:

I would but one each torrent would be 300Gb Two that's a lot of time sorting thought over 1000k photo's and three people would not just download clean versions of the website they would want a full site rip

Unless you save the tags in some sort of searchable way a siterip is kind of pointless.

Updated by anonymous

So I have finally downloaded the first 20000 posts. The total size, incluiding the hashes and tags is 3731460 bytes.

I am going to release pictures in packs of 50000 at ThePirateBay.

Dump structure is quite simple. I've modified it since the last post:

0000:
  >0045:
     >file.jpg (picture 45)
     >tags.txt (tags for picture 45, one per line)
  >0084:
     >deleted.nx (empty file, marking deleted and undumped post)
0001:
  >0083:
     >file.jpg (picture 10083)
     >tags.txt (tags for picture 10083, one per line)

EDIT: Updated to a new structure. tags.txt now also contain the MD5 hash and, from picture 23315, also the rating.

This is tags.txt for file 23315:

md5:115ce24762ae51dfc39952691303b2ad
rating:explicit
os
brandi
canine
fox
2007
against wall
bracelet
breasts
covering
covering self
female
fur
green eyes
hair
headband
jewelry
long hair
looking at viewer
nude
orange fur
piercing
pink hair
purple hair
pussy
sitting
solo

Updated by anonymous

If you want to make it useable by other people I'd recommend creating a GUI program with access to a local database.
You'd offer the core program and several database files separately. The DB files could be directly accessed by the program or imported into a central database via installation/import.
One wonderful technology for that would be a .net GUI with Microsoft SQL Server .mdf files.
If you are afraid of binary blob database drawbacks you can create a simple "installer" or import routine to save the files somewhere locally.

I would do it, but I have no means of uploading multiple Gigabytes since I probably somewhere signed an agreement not hand out furry porn from company servers...
I'd love working together with someone who does though.

Updated by anonymous

Der_Traubenfuchs said:
If you want to make it useable by other people I'd recommend creating a GUI program with access to a local database

I have already been planning this for a long time, but everyone else has been moving much more quickly than me. I'm planning on downloading and converting the site into a tagged-based file system. At least with this 4th site rip I'll hopefully won't have to waste bandwidth redownloading everything from e621 (I'm going to ask you admins a few things through PMs first). I'll have everything finished and nicely polished within a year. Probably most people won't want to wait that long...

Updated by anonymous

  • 1