Archive for the ‘Geekery’ Category

Thoof Hardware choices

July 3rd, 2007

Yesterday, I posted about the architectural choices made at the software level, and the results of some of the benchmarking that was done prior to launch. A couple readers pointed out that I didn’t mention what sort of hardware we’re running.

The benchmarks from yesterday covered one of our web nodes, a Sun X2200 M2. This is Sun’s 1-4 way 1U Opteron server, which gives us good density and power efficiency. Our web nodes are 4-way 1.8ghz affairs, with 4G RAM and a simple RAID-1 for reliability. The load balancers are 2-way with 2G RAM, same processors. Each of our database servers is an IBM x3655 Opteron server, 2U, with a six-drive 15k SAS RAID 10. These are 4-way 2.6ghz boxes with 8G RAM.

You’ll notice the web nodes are fairly low speed per-core. This makes the most sense to us, since it keeps us power efficient. We’re very confident about our ability to respond quick enough per request, so we instead go wide with the four processor solution. The databases do much of the heavy lifting in terms of our statistical algorithm, so they get the high performance CPUs, again, wide to handle many concurrent requests.

Thanks for the questions!

Build to Scale: our Web Architecture

July 2nd, 2007

Hi, I’m Scott Miller, Chief Architect at Thoof. For the geeks out there, I wanted to write a short post about some of the software level decisions we made while developing Thoof. We think this stuff is interesting, so why not share it?

When we set out to design Thoof’s software architecture, we wanted to create a flexible platform for the website. Not just for delivering pages, but for implementing complex functionality quickly and easily, and with performance in mind.

Using the open source Wicket web frame work got us a lot of the way there. It allowed us to modularize components of the site, and gets a lot of the basic plumbing of logic and presentation out of the way. Next, we needed a way to keep the site fast and responsive, independent of some of the more complex processing that underlies Thoof. To do this, we added an asynchronous events system.

The events system gives us two major things. First, the ability to decouple action from reaction. This allows each component of the site to be concerned only with its inputs (dependencies) and its outputs, and keep other components blissfully ignorant of who needs its results or who is providing its inputs. In short, it cuts down on spaghetti code.

Second, events allow us to decouple the need for processing from the processing itself. Much of what goes on in a typical request to Thoof matters a great deal, but not to the request that’s occurring Right Now. By firing an event which is processed by a threaded background engine, we can race on through what actually needs to be done during the request. Even better, it means that if something is slowing down the system, its not slowing down page generation, just other background tasks.

Controlling the performance of our complex database tasks led to another decision, to use the iBatis SQL abstraction layer. Unlike typical ORM frameworks like Hibernate, or even more “hands off” systems like those employed by Rails, iBatis lets us write the SQL queries. This is more work on the front end, but gives us much greater flexibility in the long run. If a query is performing slowly, we can edit the SQL ourselves, trying different ways of ‘phrasing’ the SQL to govern performance, without affecting the DAO pattern based model objects in the application itself. It also lets us take heavy advantage of PostgreSQL views and stored procedures, so that we can not only tweak SQL, we can often do it in the database instantaneously, without having to redeploy the application.

How well does it work? We’d be remiss if we didn’t benchmark it. Under our worst case, a series of page views which we know stress both the web application logic and the database, we were able to push about 50 requests/sec from each web node. Even better, that value didn’t seem to change as concurrency varied during the test from 20 to 200 concurrent requests.

Harnessing Rich Metadata

June 20th, 2007

One of the challenges with Thoof is encouraging good tagging. A good set of tags on the story helps readers navigate the content better, and also feeds into our story selection algorithm to ensure interesting stories are found for each user.

One way we can do this is through our tag suggestion mechanism. This is the auto-completion drop-down you see when you add or edit tags on a story. This not only lets us suggest what you might be introducing as a tag, but to correct common ambiguities and mispellings. To try this out, try typing part of “Hillary Clinton”. You’ll be suggested the senator’s full name, guessed using the metadata we’ve collected from tagging and elsewhere.

We can use this information in other ways as well, for example to infer which topics a tag might also imply, such as that Hillary Clinton may imply Politics, and so forth.

Testing Thoof’s personalization algorithm

June 15th, 2007

Since we are rather analytical types, we wanted to see exactly how effective our personalization algorithm was. Last year Netflix made available the ratings data for their movies,  so we decided to see how good Thoof’s personalization algorithm was at anticipating what movies people would like.

Like any good experiment, we wanted some controls, so we also tried this with some simpler algorithms to see how our approach compared.

For each algorithm, we would use it to select movies to show to a user, and see what proportion of those movies the user actually liked.

The first algorithm was a naive approach of randomly selecting movies. In this case, the average user liked 10% of the movies they saw.

Next we tried a “most popular” approach, taking the most popular movies, and showing them to everyone. Amazingly, the average user only liked 15% of the movies that were deemed globally popular! Bear in mind that this is essentially the story selection algorithm used by today’s biggest user-generated news websites.

Finally, we tested our algorithm, and were very pleased to find that users liked 40% of the movies it selected. A pretty clear improvement over the other approaches.

Thoof will succeed or fail based on a user’s first impression when they see stories on the front page. Our hope is that using our personalization algorithm, combined with collaborative editing, the quality and relevance of these stories will immediately set us apart from other news and information sources.

- Ian

WordPress database error: [Unknown column 'post_id' in 'field list']
SELECT count(DISTINCT post_id) FROM wp_posts LEFT JOIN wp_term_relationships ON (wp_posts.ID = wp_term_relationships.object_id) LEFT JOIN wp_term_taxonomy ON (wp_term_relationships.term_taxonomy_id = wp_term_taxonomy.term_taxonomy_id) WHERE 1=1 AND wp_term_taxonomy.taxonomy = 'category' AND wp_term_taxonomy.term_id IN ('3', '5') AND post_type = 'post' AND (post_status = 'publish') ORDER BY post_date DESC

copyright thoof 2007