What’s wrong with the English Wikipedia?

The question legitimately comes back every so often: why haven’t we updated the English Wikipedia ZIM file for more than a year ?

First of all, we are really sorry about that. This is not on purpose and we are committed to update this one as much as any of the thousand others we publish. This is a high priority for everyone at Kiwix.

What happened?

We have tried many times, and we have failed. The main reason is that Wikipedia in English is at least twice as big than any other Wikipedia. What we’re talking about here is something like 100,000,000 separate entries (articles, pictures, redirects, …) to identify, index and compress. Each attempt takes about a month of calculations, so time flies real quick. This is compounded by the fact that the English Wikipedia is still growing rapidly : our current tech has a hard times catching up and then scale properly.

This is not a new problem, and we have realised a lot of performance improvement on MWoffliner (our Mediawiki/Wikipedia scraper) during 2019. We have also introduced the full automation of ZIM creation via a newly created ZIM farm (and yes, you read correctly: until then we had to manually start the update sequence for every single file in our library).

All of this has challenged us quite a bit, and with limited resources we had to make choices. But we are on it, and a lot of our time and effort are currently invested in solving this very specific problem. As a side note, while we try to fix this, we get to solve a lot of smaller issues on the way, meaning minor but significant increments. For instance, it is now a lot easier to generate a ZIM file from a non-Wikimedia wiki. So there is that.

It is difficult to say when we will see the end of it : predictions are hard, particularly when they’re about the future. But if you want to help and have Typescript knowledge please have a look at https://github.com/openzim/mwoffliner/issues…. Caching is one of the next steps we want to implement to speed up scraping.

In the meantime, we have started to release a range of thematic selections (maths, physics, history, etc.), as well as something called TOP, ie Wikipedia’s 50’000 most demanded/complete articles. We hope this might be an acceptable workaround for many users. Thanks for your patience, and stay tuned!

What’s wrong with the English Wikipedia?

What happened?

Next