Wordpress is a very popular PHP-based blogging platform, which uses a MySQL database. It has a certain reputation for slowness, but still sees a lot of use.
Last year, someone looked into what queries Wordpress (1.5) uses in preparing pages; article here. I thought it would be interesting to see how things have changed, so I got the latest version of Wordpress (2.2) and installed it on my desktop, with MySQL set to log all queries.
Now, before I start I should mention that for a couple of years I used Wordpress for this blog. I was never enormously irritated with it, but I did start having considerable speed problems as the blog grew. I was able to deal with some of these by adding appropriate indexes to database tables; however, in the end, it became more trouble than it was worth. I switched to Google Blogger, and have had no trouble since.
Now, on with the query review. Please note that I am by no means a database expert, and that I may have made mistakes.
On displaying a page, Wordpress first finds the site URL in the options table. It then loads all records from the options table which have 'autoload' set to yes; note that the site URL row is one of these. Next, it loads 'widget_text' and 'widget_rss'; neither of these exist on my installation.
It gets the user's details; this is an improvement on 1.5, where, as noted in the above article, it loaded details for all users. This was particularly concerning as some larger blogs allow their commenters to log in as low privilege users to comment; the amount of data retrieved could, then, be considerable.
Oddly, there is no index on the 'user_login' column; this both allows, potentially, for duplicate users (although I'm sure that this is guarded against in the application itself) and means that if you really have a lot of users, it may have to do a fairly large table scan. This won't be a problem for most Wordpress blogs, though. It then gets data for that user from 'wp_usermeta'; this could have been done as a join, but is, at least, done on an indexed column.
Interestingly, it initially looked for a rewrite option in the options table at this point. When I enabled URL rewriting, however, this query vanished, presumably because it got it in its first read of the options table. This seems odd; if it didn't get it the first time 'round, why try again?
The next query finds the last ten posts, and downloads all data on them in the 'wp_posts' table. Fine, although there are no indexes on any of the constraints it uses (even the date!) so this really isn't ideal for a big blog. The one after that, though, is an oddity. It looks up a mapping between posts and categories. This is the sort of thing that I would really expect to have been done with a join; I can't imagine that avoiding a join saves them any runtime, and it's an extra query. The query after that gets all category data, and the query after that all metadata for the posts. All of these except for the category lookup could have been done in one query.
The next three queries retrieve settings for the (default) template. I really don't feel that this is the sort of thing that should be done in the database, but see comments below on configuration data.
Next, pages are retrieved; Wordpress has a separate, non-blog-post type of page. Oddly, all of these are retrieved, with full text, even though they're never going to be displayed on this page.
The next step is to find what months there are posts for, and how many. Again, un-indexed fields are used as constraints, and counts are done. I'd be inclined to store this data in a separate table and only update it when new posts are posted.
Next, categories are found (again, but in a specific order this time) and posts associated with each discovered; again, insufficient indexes. Following that, categories are retrieved again. That's the third time, mind you. It would seem to make more sense to just get them once and perform the ordering in the (PHP) client, but whoever wrote Wordpress doesn't seem to think like that.
Links for the sidebar are retrieved, with a constraint based on the time of day used to find recently added ones. This will, I'm fairly sure, prevent the results of the query being cached, and could be determined in the client code anyway. Oddly, a join is used here.
Mercifully, that is it. The index page is loaded and all is well in the world. "But wait!" you say. What about comments? What, indeed?
Well, to see comments, we'll have to click on a post. Pop! Queries fly high into the air!
First, we have more options and user stuff. Fine. Next, we get the appropriate post. All seems more or less sane so far. Again, less than ideal separate queries, but it was hardly likely that that'd be different here, links notwithstanding.
A little more weirdness; the previous and next post are found (a query each) and then retrieved in full (another query each). I can't fathom the thinking behind this.
Finally, we get to comments. Mercifully, except for the order by date clause, indexes are used here. Big blogs can have a lot of comments, and it would be mad not to have indexes on them.
Similar madness is involved in retrieving posts for RSS feed; a potential concern because popular sites can be hit by thousands of RSS feeds a day.
Interestingly, throughout, there are occasionally '1=1' clauses in the queries. This isn't really a problem in itself, but indicates something weird going on, possibly. I can't face reading the code to find out; I do so dislike PHP!
So, what is my conclusion? Clearly, things have gotten better since the above article was written, but they don't seem to be all the way there yet. There seems to be a general reluctance to use indexes, and complete willingness you use very large numbers of queries. While using large numbers of queries mightn't sound like a big problem, each does take time to parse, and if you're on a shared server you're likely to be contending with other people to submit queries; best get the job done in as few as possible. Some shared hosts actually limit customers to a certain number of queries per day. The multiple options queries, in particular (about eight or ten per page) are odd; they could easily be done in a single query per page.
Putting options in the database is itself a strange move; I suppose the reasoning behind it is to expose the user to as little messing with file permissions as possible, but it is certainly non-ideal from a performance point of view.
At this point you may wonder why I don't fix the problems I see. That is, after all, the dreary cry that people who criticise open-source software in any little detail are honoured with; "go and fix it yourself then". Well, first off, as I say, I am no database expert. I am also no PHP expert, and would prefer to keep it that way, if you don't mind too much.
But besides that, and here, I feel, may lie a large part of the roots of the problem, the Wordpress community (not, I would like to stress, the developers, rather the people who use it) seem to be, largely, extremely offended by criticism. If you look at the Wordpress forums, you will see that people are very much inclined to blame any speed problems encountered on hosts, configuration, or, in a few cases, the sheer alleged awfulness of MySQL!
I just can't understand how, with such a popular but clearly deficient product, these problems haven't been fixed.
If you are one of these people who is very attached to Wordpress, please refrain from hunting me down and killing me. Thanks.
Monday, July 30, 2007
Wordpress DB Query Review - Why, exactly, is it so bloody slow?
Labels:
blogging,
Technology,
wordpress
Subscribe to:
Post Comments (Atom)

7 comments:
Comment from Conor O'Neill:
Ah, so the normal WP forums are as bad? I made the mistake of criticising a recent change in WPMU (the removal of all support for user entered divs and spans in blog posts) and got roasted alive by their attack dogs.Had been considering moving off the platform anyway but that sealed the deal.
Comment from Conor O'Neill:
Doh, replace div/span with class/id. Slow start to day.
Comment from Robert Synnott:
Oh, yes, I didn't even touch on the plugin issue; I've heard of sites requiring over a hundred queries per page with commonly-used plugins.There is, of course, wp-cache, but that really fixes a problem (imperfectly) which shouldn't exist in the first place.
Comment from Michele:
I've seen lots of performance issues with WP both on shared and dedicated servers.Apart from the DB queries there is no caching by default and once you add a couple of plugins you end up with all sorts of interesting problemsI'm seriously considering moving to MovableType 4 once it's stable
Comment from Michele:
It should ship with *sane* plugins and warnings about performance.Movable Type is much easier on the server unless you're doing a rebuild, though they've also improved a lot of that for MT4
Comment from Donncha O Caoimh:
Watch out Robert, I'm looking for you! :)
No, seriously, that's useful info and there have been some serious database upgrades done for the 2.2 release. MySQL's query cache is used a lot more than it used to be. You might not be a db or php guru, but that hasn't stopped many people from getting involved in GPLed projects in the past.
If you use memcached it takes a load off your db server. That's a must-have for any large WPMU site. It was originally developed for use on Livejournal but works equally well on WordPress. A lot of the queries you see are from internal WP functions. As long as your theme or plugin uses those functions they will more than likely be cached in memcached, or WP's internal cache. The first hit on a page may take 20 queries for example, but after that the same page may only require 2 or 3 because of memcached.
Why didn't you ping me or one of the other WP developers with this post? I read about it on Michele's blog!
Comment from Robert Synnott:
Well, I'm hardly the only person to have made these sorts of criticisms, and I'm sure that the Wordpress developers are aware the problem exists. Also, I don't know any Wordpress developers. :)
Now, about using memcached. First, for the vast majority of Wordpress users, I really doubt memcached would be an option; they are presumably largely non-technical, and on shared hosting.
Second, a lot of the problems pointed out above shouldn't exist in the first place; it should be perfectly possible to rewrite the DB access stuff so that fewer overall queries are issued, and in some cases, a considerable speedup could be gained from simply adding a few well-placed indexes.
Using memcached is obviously a solution, but it's an awkward solution to a problem that to a large extent shouldn't exist in the first place.
Post a Comment