The Directory

I’ve decided to host The Directory again. This time the “adult” section redirects to the home page. Once the database is done being populated I’ll just delete the adult top link and the top links that have no content.

It was actually a pretty painless process to set up the site. I just had to download the latest RDF files from dmoz.org and get the PHP script that parses them and populates the initial database. It has bugs in it which is lame considering the latest version is three years old. They prevent the script from running but I was able to fix them easily enough.

The second problem is getting the actual link information. There is a table with basic information that is joined to another table based on the url. The problem is that the URL is a TEXT field which can’t have an index. This means it takes a very very very long time to get URL information. I did some quick queries and determined that only about 1000 of 4.5 million links had a URL longer than 255 characters. So I deleted all those URLs and changed the field to a 255 character VARCHAR and gave it an index.

The final step is to go through the structure and pull down all the information for each directory. That information is then serialized and stuffed into a simple table with the title MD5′d as the index. So when you visit a page, the directory you’re in is MD5′d and the information is pulled from the database in a single fast query. This avoids expensive joins for all the URLs since it’s already done.

What isn’t fast is populating that table. The entire directory should be up in about 3-4 hours. There are nearly 750,000 directories that have to be processed.

2 Comments

  1. bidding directory:

    bidding directory…

    Finding Audi spares for the car you are working on is no problem. The World Wide Web is covered with sources for spare parts. The local breaker yards and junk yards have plenty of salvaged vehicles for the picking apart. The key to finding the spares y…

  2. bidding directory script:

    bidding directory script…

    ) Some individuals or companies have abused the TrackBack feature to insert spam links on some blogs (see sping). Some weblog software…

Leave a comment

You must be logged in to post a comment.

ss_blog_claim=70b9168863fc97c91e6d88b40542a327 ss_blog_claim=70b9168863fc97c91e6d88b40542a327