As CTO of Pathfinder Media (a startup with no other dev resources than myself), I have the pleasure of designing, developing, administering, and scaling every single insignificant (or life saving) feature/enhancement/bugfix ever to cross my desk. While it’s definitely awesome to be able to maintain that level of oversight, it can get … well, fairly involving.
Example: One of Pathfinder’s portfolio sites, CentSports.com, is seeing a massive redesign. One seemingly minor part of the redesign is allowing a user to upload an avatar. Yep, that’s it. Upload an avatar. I’m writing a blog post about uploading images, get over it.
Why does this matter? Basically, the sheer scale of things. Lets say CentSports has 500,000 users, and each avatar averages at about 20kB. That comes to just under 10GB worth of images, and even more if we decide to resize the image multiple times. Now, lets say 150,000 of those users visit the site daily, visit their and four buddies’ profiles, and for some reason have a broken cache (unlikely, yes, but roll with it). That would end up being around 440GB of monthly bandwidth. Which really isn’t a big deal, if you have the infrastructure to process them, serve them out, back them up, etc etc. That comes to my next point:
Right now, CentSports has a few solid webservers. They fare decently during peak hours, but are only tuned for the type of traffic currently being served, with not too much breathing room. That being said, keep in mind that image resizing is not a cheap process. Using PHP+gd, If you upload a large (4000×6000) image and downsample to something as small as 200×300, you could see your process’s RAM usage peak to 40-50MB (per image). That’s pretty steep. Plus, we’ve noticed that during most feature rollouts, those new features tend to get a solid beating immediately after unveiling. The thought of a tens of thousands of stampeding image resize requests going through my regular web servers all at once is pretty scary (and something to keep in mind while designing a solution).
Cloud, Come Hither!
I’ve been looking for a good excuse to put Mosso’s CloudFiles CDN to the test for a few months now, and here happens to be a perfect (cpu-bound) task to take advantage of todays latest cloud computing offering. So, today, in a matter of hours (I actually spent more time compiling than engineering), I whipped up a simple, but potentially robust-as-hell solution to my problems. Here’s what you do:
- Get a Cloud Files account.
- Get a Cloud Server. Start small, for testing.
- Set up Apache + PHP (with curl and libgd) on your new Cloud Server.
- Create an upload form on your Cloud Server, and in the capture stage implement your resizing functions as well as the Cloud Files API to push the images into Cloud Files CDN. Q&D Example:
<?php
...
//"new" filename and location of the image you just uploaded. Probably coming from $_FILES.
$filename = 'existing_file.jpg';
$file_location = 'images/'.$filename;
$auth = new CF_Authentication($cf_user, $cf_api); //set up CF auth obj, with your username and API key
$auth->authenticate(); //validate into CF
$conn = new CF_Connection($auth); //connect to CF
$images = $conn->get_container(’images’); //Get and use the ‘images’ container.
$put_image = $images->create_object($filename); //create the object in CF. The file’s path name will be $filename
$put_image->load_from_filename($file_location); //insert the contents of $file_location into the new object you just created
echo $put_image->public_uri(); //returns the URL of the new $filename object. Useful for storage later
?>
- Make your app aware of the new image servers. The nice thing about this, is that if you create multiple (read: as many as you want) server instances and have the same logic across all, you can easily track which servers are the busiest, and your app can effectively direct traffic to each leaf on its own. Very cheap load balancing. Update: you can create new instances from backup images, which should make it much easier to replicate server (be careful when using ip-based configs).
- Finally: go get a drink. Since both your instances and your data within the CDN are backed up and replicated, you have less maintenance to worry about it. Note: I didn’t say no maintenance. At the end of the day, service providers can’t be trusted to be perfect, and it’s still your responsibility as Administrator to maintain off-site backups and remote+hot failover. But that’s a discussion for another day.
Off Topic rant: all of this is moot if you can’t get your UI to play nice. Man, how I hate cross-host scripting.