Working with Large Datasets
Wikipedia is a beast. My mirror of the english edition is quite old and contains 2.7 million pages with nearly 4 million revisions. The database file is over 7GB. My little server manages to handle it but not very well at all. Mostly because MediaWiki is a convoluted mess of wiki software. So, in order to make it more efficient I’m currently dumping all the current pages (no revision data) into 16 files with the md5 of the page title and then the contents of the table entry serialized and base64 encoded. Once this process is done another script will read those 16 files and push the data into a database with 256 tables. When a user requests a page, the title will be set to lower case and md5′d. The first two characters tell the script which table the page is in, it grabs it, decodes it and we’re done.
DMOZ is set up the same way although there are no lonely directories in DMOZ. Wikipedia has a slight challenge in that you’re not guarenteed to be able to find your way to any given article from any starting point. So, there is going to be a huge index and you’ll finally be able to see just what wikipedia had at the time of the datadump.
The net result is a wikipedia clone that will be able to be run on even low end servers. I’ll also be much easier to transfer from one server to another. Unfortunatly this process is going to take at least a couple days but when it’s done it’ll be worth it. There’s nothing more annoying than waiting half a minute for a page to load. Except knowing that there’s a simple way to improve the performance.
Leave a comment
You must be logged in to post a comment.