Topic: [DEPLOYED CHANGE] Crawlers, Bots, Page Numbers and You.

Posted under e621 Tools and Applications

In the next release of e621 a cap on page numbers will be added. The current cap is at page 750, but this is not a finalized number and may be subject to movement.

If you do not regularly encounter situations where you need to access content beyond page 750, this has no impact on you.

Attempting to access content with a page number above the cap will be an error. The UI for e621 has always swapped to using a different pagination system at page 1000(now being lowered to 750), but it was possible to bypass this swap by using page numbers manually entered. Mandatory use of before_id is going to be enforced to access content beyond page 750.

As a note, before_id has an explicit sorting order applied to it and your sorting order will be ignored, the sorting order is post id descending.

If you're only looking for all of the information sorting order is no longer important and a much faster scrolling system can be used. The existing sorting orders negate the need to access large page number content if you are not needing to access all content(scraping).

Please update your applications to make use of before_id if you are doing scraping/scrolling type queries where you need all of the content regardless of how many pages are involved.

To properly use before_id you set the before_id parameter to the lowest numerical post id within the results for the next request, and no before_id is provided on the first request to find the newest content.

As a note for people using /post/show.(xml|json), it is encouraged that you do not use /post/show.(xml|json) unless you are requesting a small number of posts and explicitly know their ids. /post/index.(xml|json) provides the same information and in bulk. Bots that iterate over a large number of post ids and use /post/show.(xml|json) may incur additional rate limiting.

Updated by NotMeNotYou

I'm not sure if it's related to this, but my e6Extend favourites count has now capped at 750 posts even though the actual amount is over 2500.

Updated by anonymous

BlueDingo said:
I'm not sure if it's related to this, but my e6Extend favourites count has now capped at 750 posts even though the actual amount is over 2500.

o_O how do you have -13 favorites on your profile?

Updated by anonymous

treos said:
o_O how do you have -13 favorites on your profile?

He favorited posts that were later deleted?

Updated by anonymous

BlueDingo said:
I'm not sure if it's related to this, but my e6Extend favourites count has now capped at 750 posts even though the actual amount is over 2500.

I'm guessing it does a search for your favorites and sets the per page to 1 and gets the page count to get your favorite count. While a nifty idea, I can't guarantee page counts to be accurate in that scenario because of how this works.

Updated by anonymous

I don't see the point in doing this, I despise sites that use page caps, but I let this site slide because I could manually get around it, but now you completely taken away the option. Not only that but trying to find the before_id correlation to page number doesn't work.

For example. A picture found on page +6000 id number 697019 when pluged into the before id, takes me to the surrounding series of pictures, and I can go back into older pictures, but trying to go back takes me back to page 750.

This is way to much of a hassle and I may have to give up on his site as well.

Updated by anonymous

yttocs5991 said:
This is way to much of a hassle and I may have to give up on his site as well.

You're seriously going to consider leaving because of a limitation that is easily circumventable?

Updated by anonymous

BlueDingo said:
You're seriously going to consider leaving because of a limitation that is easily circumventable?

For the past hour and a half I've been trying to work around this but nothing I try is working on this site.

Updated by anonymous

Add id:<# to your search, replacing # with a post ID, to remove every image with an ID higher than that from the search.

Updated by anonymous

I don't want limitations in my search, and it still limits it to 750. Am I missing something here? What's the point in having a cap at all! I used to use this site for archiving purposes of furry and etc art and Paheal for my rule 34.

Updated by anonymous

Neither do we but having one is not the end of the world.

You can also increase your posts per page count to 320 which will increase the number of accessible posts to 240000 without needing circumvention. Most searches won't have that many posts in it.

Updated by anonymous

I'm not doing searches, I'm doing broad cataloging. From beginning to end. This still doesn't answer the question of why.

Updated by anonymous

yttocs5991 said:
I'm not doing searches, I'm doing broad cataloging. From beginning to end.

Do it in chunks. You'll only need 6 attempts to get everything. Why are you trying to catalog the whole site, anyway?

yttocs5991 said:
This still doesn't answer the question of why.

Ask KiraNoot.

Updated by anonymous

A few days ago abadbird said:
Bookmark/Favorite something like order:id id:>1250000 every time when done with a browsing session, replacing the id # with that of the post where you left off.

This way you don't have to deal with before_id.

order:id reverses the post order, so the post with the lowest ID becomes the first result. id:># is your cutoff point, the last post you saw before which are older posts you don't want to see again.

yttocs5991 said:
I'm not doing searches, I'm doing broad cataloging. From beginning to end.

Now you need to search before beginning a session of cataloging, or deal with before_id. It's no big deal.

I'm guessing the why of it is performance reasons, trimming results so the servers can respond quicker for you and everyone else using the site at the same time.

KiraNoot said:
If you're only looking for all of the information sorting order is no longer important and a much faster scrolling system can be used. The existing sorting orders negate the need to access large page number content if you are not needing to access all content(scraping).

Though I don't really know what all that means.

Updated by anonymous

yttocs5991 said:
I'm not doing searches, I'm doing broad cataloging. From beginning to end. This still doesn't answer the question of why.

Because every few days some yahoo would get the smart idea to hit high page numbers incrementally as fast as they could and the site would stop working for everyone. I don't like soft outages, I like them even less when they repeatedly happen at 4am in my time zone and I have to get up and block somebody so the site continues to work.

I had an explanation of why I had made the change in the opening post, but lo and behold people took it as a challenge and it caused several soft outages, had to get blocked and I removed it from the post. We just can't have nice things.

The problem is this. In order to paginate results, the server has to take every post within the results, sort them and then pluck the requested items for you. There are some optimizations that can be applied when this happens on lower page numbers, in fact, lots of optimizations. As the page numbers go up, it takes significantly more time to get the results. Using the stable scrolling system of before_id you have no slowdown involved in accessing content, and you will always see all of the content, regardless of if the page numbers would have shifted or not. The downside, as you have discovered is that there is no easy way to make a back button, so going back currently relies on your browsers history.

What abadbird is suggestion is actually what before_id does internally, there is zero difference involved. And you can create a "after_id" version by swapping the sorting order and swapping the comparison.

Updated by anonymous

abadbird said:
Though I don't really know what all that means.

If you want all of the results from a given search, the order they are presented in doesn't matter to you. Sorting them is a complete waste of time. It's like asking somebody at a candy counter for 10 red pieces of candy, 10 green pieces of candy, and 10 orange pieces of candy, and then deciding. "Never mind, just give me the whole jar."

The part about negating the need for high page numbers is about people being curious what the lowest scoring post on e6 is. Some people use order:score and then go to the last page instead of order:score_asc and getting it on the first page. The former is slow, the latter is fast.

Updated by anonymous

With a default result count per page of 75 and max 750 pages one can get 56250 results...

I noticed that if you override the &limit in your query you can go down to a total of only 750 results, but also up to 240000 results with &limit=1000 (replaced with max?). I'd suggest you limit the actual results instead of the pagination as the pagination itself shouldn't cause as much load as fetching the results...

the algorithm for that should look something like this:

firstresult = limit*page
if (firstresult > 56250) {
  raise page error
}
lastresult = limit*(page+1)-1
if (lastresult > 56250) {
  lastresult=56250
}
query posts wiht offset = firstresult, limit = lastresult-firstresult

P.S. sorry if this should be posted in it's own topic

Updated by anonymous

All I wanted to see the early posts... and hey:

"Access denied

You may not seek to pages beyond 750."

I'm NOT a bot... I'm here for like... 7+ years.

Is there a way to check the first 100-200 posts without having to open one to one?

Updated by anonymous

NSFW said:

Is there a way to check the first 100-200 posts without having to open one to one?

order:id would also work. Basically any search that has the posts you want show up first will work.

Updated by anonymous

Writing in 3 years old thread is weird but I always forget how to check really old stuff and I can't find way to pin thread for myself. It should be explained somwhere more accessible (maybe it is but I just missed it).

I am glad that the above poster replied right now, because I was going to report this as a bug since I had no idea that it was a feature.

IMPORTANT: the 750 page limit also applies to the tag list, but there is no way to search by "Oldest" there as far as I know. Could that be implemented?

  • 1