Lone Programmer: 2012

Tuesday, 31 July 2012

Be careful with memcached keys

Using invalid characters in a memcached key can cause disasterous problems on your server, but first, the details...

Memcached's text protocol uses the space character as its delimiter. For example, the following line:

set mykey somevalue\r\n

will set the key "mykey" to the value "somevalue".

If you know that the protocol uses the space character as a delimiter it is very obvious that key's should never contain spaces as it will mess with the memcached parser. However, if you use the PHP PECL extension to access the memcached server you don't necessarily know the protocol as it is hidden from you and in none of the PHP documentation is it mentioned to not use spaces in your keys.

So what is the big deal though, surely it will simply not work when using an invalid key to get/set data?

Sadly, this is not the case. In fact, what will happen is that your PHP instance/script will halt completely and hold the Apache thread forever, causing Apache to start new threads/workers until your server runs out of resources and completely falls over.

How we discovered the problem.

We have an application that was never supposed to send keys that contained spaces in them in the first place but sadly, out of millions of requests received every few hours, some actually did contain spaces that managed to slip past our checks (thanks to a crappy reg-ex check mainly). I noticed our server would simply start running out of resources and the number of Apache processes kept climbing and climbing.

Debugging the problem took 2 full days as I didn't know what was causing the issue in the first place and at first couldn't replicate the problem on my test server. What i did know was that amongst one of the many features we added to our application, one was to use memcached for some additional memory storage we required caching and in the back of my mind I had a feeling that memcached had something to do with the problem. I just couldn't replicate it, not knowing that spaces in keys caused the problem or that there was keys with spaces in them.

Through process of elimination I eventually found the problem and I will show an example below on how you can replicate the problem yourself. Don't run this on a live server, but you can easily test it on your development machine. This simple PHP script will crash the PHP instance/Apache thread and is one of the very few things you can do to cause such a serious problem:

<?php
    $memcached = new Memcached();
    $memcached->addServer("localhost", 11211);
    echo "Setting key (and hanging instance)<br>";
    $memcached->set('spaced key', 'Some data', 60);
    
    echo "Your script will never reach this point and will never timeout either<br>";
    echo $memcached->get('spaced key');
?>

To recover from the problem simply restart your Apache server:

/etc/init.d/apache2 restart

How you can avoid the problem.

Avoiding the problem is very simply. One method is to convert all keys to valid keys before getting or setting values. As an example, you can use the function below:

function memcachedKey($key) {
return preg_replace("/[^A-Za-z0-9]/", "", $key);
}

You can then use the function like this:

$memcached->set(memcachedKey('spaced key'), 'Some data', 60);

$memcached->get(memcachedKey('spaced key'));

Another method (but which I personally don't like) is to MD5 your keys before setting and getting them. The reason I don't like it of course is that an infinite number of key names may end up with the same names so use this MD5 hashing only when you know that you won't deal with an infinite number of different strings as keys.

Other things to check.

Make sure your key is never empty either as it will produce the same problem.

Conclusion

I sincerely hope that others that run into the same problem doesn't have to spend so much time trying to fix it and that they'd come upon this blog post before pulling their hair out. Google, now go forth and do your job. :)

Wednesday, 28 March 2012

PHP arrays and Copy-on-write

What is Copy-on-write and why is it important to understand?

To start with this topic, lets see what happens when we assign one array to another and then change the first element of the first array:

$a = array("apples", "oranges", "peaches");
$b = $a;
$a[0] = "grapes";
print_r($a);
print_r($b);

Results:

Array ( [0] => grapes [1] => oranges [2] => peaches )

Array ( [0] => apples [1] => oranges [2] => peaches )
As with most simple variables, arrays are passed and assigned by value, in other words, when we did $b = $a we seemingly created a duplicate of the original array. We know this because we changed the first entry of the first array and the change wasn't reflected in the second array.

If you are used to other dynamic languages however you'd think that assigning an array would perform an assign-by-reference in order for the array not to be duplicated in memory. You would then assume that changing one array would also change the original array since they are basically the same array. Duplicating arrays as PHP is doing here would cause concern for anyone who cares about memory usage and who uses a lot of arrays to pass data around. You would also rightfully be concerned about the speed impact of having PHP duplicating arrays all the time.

But this... this is madness!?

Luckily for us, there is method behind all this madness. PHP uses what is called copy-on-write technology. What this means is that the array is actually assigned by reference and that a copy of the array is only made if any one of the arrays is changed later on. When we did $b = $a there was still only one copy of the array in memory up to the point when we changed one of them.

There is a question though... If you really don't want PHP to duplicate arrays, ever, should you always implicitly pass/assign arrays by reference, e.g.:
$b =& $a;

Well, yes you could if you really wanted to, but be careful since passing an array to a function by reference, e.g.: function test(&$parameter){} is actually slower than just passing the array the usual way, e.g.: function my_function($parameter) {} since PHP needs to do extra work behind the scenes.

Some links regarding the topic:
Research paper on Copy-On-Write in PHP (PDF)
http://php.net/manual/en/functions.arguments.php
http://www.php.net/manual/en/features.gc.refcounting-basics.php
http://php.net/manual/en/internals2.variables.intro.php

Friday, 3 February 2012

Daily Link Round-up

WebSocket Servers Part 1 and Part 2.

In this 2 part blog post, John McCutchan writes about implementing WebSockets into game servers as a means to provide remote web based administration to the game servers.

Quick notes on using Twisted Matrix's (Python) Websocket functionality

As the title says, Eliot talks about trying out Twisted Matrix's Websocket functionality. Twisted Matrix itself is a great Python networking framework. As a big fan myself I do suggest you give them a visit. While I am on Websockets, also check out Autobahn Websockets RPC/PubSub.

Firefox's developer's tools getting better with Firefox 10

From the website: "When debugging a web page, the last thing one needs is to have the browser crash under the memory-hogging ability of a plug-in. All web developers have been there with Firebug and its propensity to make a web page either incredibly slow or take the browser down with it.

Firebug is still my web debugger of choice, but Firefox has taken steps towards closing the gap with its new Firefox 10 release. The video below shows off the new features:"

Dive Into HTML 5

"Dive Into HTML5 seeks to elaborate on a hand-picked Selection of features from the HTML5 specification and other fine Standards. The final manuscript has been published on paper by O’Reilly, under the Google Press imprint. Buy the printed Work — artfully titled “HTML5: Up & Running” — and be the first in your Community to receive it. Your kind and sincere Feedback is always welcome. The Work shall remain online under the CC-BY-3.0 License."

MySQL's (crappy) Spatial Extensions

Introduction

In the world of mapping it is often required to determine whether a certain point on a map falls within pre-defined zones or areas, or the opposite, to determine the specific points that falls in a particular zone or area. These zones are defined by shapes such as polygons, rectangles and circles/ovals and are often referred to as geospatial data or objects. Luckily for database designers many of the popular database engines now supports the indexing and storage of these geospatial objects for quick indexing and searching.

Since geospatial technology in itself is a rather vast and relatively complex subject I'm not going to discuss it in great depth here. I need to mention though that most database engines support the OpenGeo specification for spatial data. You can read more about it here.

MySQL's Spatial Implemention and its Catches

Let me first say that although MySQL is an immensely popular database engine it does have its drawbacks in some areas. Its Geospatial implementation is one of those areas unfortunately. Lets have a quick overview of these issues:

Problem 1: Even in the latest and greatest MySQL v5.6 InnoDB has no support for spatial indexing, meaning that although it supports the storage and some functionality revolving around spatial data, it is all going to be terribly slow to try and lookup anything. This means that if you are an InnoDB user you'll have to revert to MyISAM for fast spatial index based lookups.

Problem 2: Geospatial data can not be dumped or exported. That's right, MySQL stores the geospatial data in a binary format that is completely ignored by mysqldump's --hex-blob and --compatible=target parameters. What you get instead is a whole lot of binary garbage in your dumped text file that you will be unable to use for imports in the feature. What this comes down to is that you will have to ignore the table when you do a mysqldump (e.g. --ignore-table=yourdb.yourgeotable) and write your own tool to parse and dump the geospatial data. This issue is absolutely terrible in my books.

Problem 3: Although fixed now, MySQL 5.0.16 had a bug where if you tried to use geospatial functions using InnoDB it would literally crash the server instead of just generating SQL errors. If you are still dealing with an older unpatched MySQL 5.0.16 be aware.

Problem 4: Since you have to use MyISAM for your geospatial indexes you have no support for transactions.

Problem 5: The limited implementation of OpenGeo (OGC) function support can make it really hard or impossible to use MySQL for complex geospatial functionality.

Conclusion

Using MySQL's Spacial implementation is Ok if you are willing to use MyISAM for your table structure and if you intend to use fairly basic indexing of geospatial features. It is not absolutely worthless but I really hope that Oracle will allow MySQL to catch up to other database engines in this regard in the near future.

Thursday, 2 February 2012

Daily Link Round-up

The Evil Unit Test (Blog Post)
In this blog post, Alberto Gutierrez complains about the fact that programmers tend to use Unit Test's so religiously, without regard to its use and practicality in all situations.

Tabs vs Spaces, need I say more?

Introduction to Git

In this nearly 2 hour presentation, Randal Schwartz presents "Introduction to Git", presented on January 5th, 2012 for the monthly UUASC-LA meeting.

It should only take a few hours... (Blog Post)

Hilton Lipschitz talks about how people underestimate the amount of work that goes into software development.

ålenkå - A GPU Based Database Server

From the SourceForge page: "Alenka is a modern analytical database engine written to take advantage of vector based processing and high bandwidth of modern GPUs:

Vector-based processing: CUDA programming model allows a single operation to be applied to an entire set of data at once.
Self optimizing compression: Ultra fast compression and decompression performed directly inside GPU.
Column-based storage: Minimize disk I/O by only accessing the relevant data.
Fast database loads: Data load times measured in minutes, not in hours.
Open source and free."

So far this is still in very early development and lacks some serious features when compared to current database solutions. But it is definitely showing potential in terms of performance and who knows, one day it might just lead to something big.

Funny Bill Gates Email rant (Blog Post)

In the above linked .pdf file from 2003 an email discussion goes on about certain issues on the microsoft.com website. Things get interesting when Bill Gates starts ranting (from page 3) about the trouble users have to go through to download software from the microsoft.com website. Things have certainly changed a lot since then for Microsoft but I have not doubt things got pushed significantly when the big boss (who is a geek himself after all) voiced his utter dissatisfaction in cases such as these (together with some competition from other tech companies). This is the very reason a large company such as Microsoft needs some clear direction from the top boss. Of course the opposite is often true as well, a great deal of businesses have been pushed over cliff edges due to poor management decisions and lack of drive.

A quick look at ApacheBench

Introduction

Benchmarking tools usually performs stress tests on a piece of software to see how it performs under pressure or heavy load. In turn one can optimize your source code or server configuration as a result of the benchmarking tests. There are a number of HTTP server benchmarking tools available, including ApacheBench, Apache JMeter, curl-loader, openSTA, HttTest and httperf. Today I will be posting about ApacheBench, or known simply as, ab.

ApacheBench is a rather simple and basic tool but then again it is quite easy to use. I installed it on my Ubuntu setup using apt-get:

$ sudo apt-get install apache2-utils

Lets take a look at a basic test and its result:

$ ab -n 100 http://localhost/mysite/index.php

The above test basically makes the same request 100 times to the specified URL. The length of time the test takes obviously depends on the number of requests you specified, the rendering speed and the output size of your site as well as the speed of your server/PC and connection speed. The site I tested is a PHP based site hosted on my own PC and the results showed the following:

Server Software:        Apache/2.2.14
Server Hostname:        localhost
Server Port:            80

Document Path:          /mysite/index.php
Document Length:        6937 bytes

Concurrency Level:      1
Time taken for tests:   6.492 seconds
Complete requests:      100
Failed requests:        98
   (Connect: 0, Receive: 0, Length: 98, Exceptions: 0)
Write errors:           0
Total transferred:      737279 bytes
HTML transferred:       693479 bytes
Requests per second:    15.40 [#/sec] (mean)
Time per request:       64.915 [ms] (mean)
Time per request:       64.915 [ms] (mean, across all concurrent requests)
Transfer rate:          110.91 [Kbytes/sec] received

Connection Times (ms)
              min mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:    10   65 318.0     20    2895
Waiting:       10   65 318.0     20    2895
Total:         10   65 318.0     20    2895

Percentage of the requests served within a certain time (ms)
50%     20
66%     22
75%     24
80%     27
90%     41
95%     61
98%   1407
99%   2895
100%   2895 (longest request)

The most useful results here are the "Time taken for test", "Requests per second" and "Time per request" readings. I can for example see that the complete time it took to run the tests were 6.492 seconds and that, on average, it took 64.915ms (milliseconds) per request.

However, notice that the result is also showing that 98 of the 100 tests were failed requests, does this really mean that almost all our requests failed? Luckily for us it doesn't, the Failed Requests result is a little misleading at first. If you look closely it actually tells us that 98 of the requests returned were Length errors: (Connect: 0, Receive: 0, Length: 98, Exceptions: 0)

But all this means is that after the initial HTTP response, subsequent responses contained differently sized HTML documents. Since the site I was testing creates dynamic content this is bound to happen and therefore there is no reason for me to worry about the reading as the actual test results are still valid.

Concurrent Testing

Unless you only have 1 user ever visiting your site (kind of like my blog) it is quite meaningless to test your site this way. We need to add some concurrency to simulate multiple users accessing your site at the same time. To do this we use the -c flag and by specifying a number of concurrent connections, for example:

$ ab -c 10 -n 100 http://localhost/mysite/index.php 

This still means we will only be performing 100 test requests, but we will be making 10 requests at a time instead of 1.

Using KeepAlive

There are some catches when using ab you will need to look out for, one of them is that the KeepAlive option is turned off by default. This means that every request sent to the server is done over a new connection which is terribly slow and effects your test results. If your own site is configured to handle requests over the same connection by using KeepAlive then it will make sense to turn on KeepAlive for your benchmarking tests as well. This can be done using the -k flag. Example:

$ ab -ck 10 -n 100 http://localhost/mysite/index.php 

    

$ ab -k -c 10 -n 100 http://localhost/mysite/index.php 

    

Other useful options

Lets take a look at some of the other useful options as well:

-A auth-username:password - Supply BASIC Authentication credentials to the server. The username and password are separated by a single : and sent on the wire base64 encoded. The string is sent regardless of whether the server needs it (i.e., has sent an 401 authentication needed).
-e csv-file - Write a Comma separated value (CSV) file which contains for each percentage (from 1% to 100%) the time (in milliseconds) it took to serve that percentage of the requests.
-p POST-file - File containing data to POST.
-t timelimit - Maximum number of seconds to spend for benchmarking. This implies a -n 50000 internally. Use this to benchmark the server within a fixed total amount of time. Per default there is no timelimit.
-w - Print out results in HTML tables. Default table is two columns wide, with a white background.

Feel free to look at the official ab man pages (manual) here.

Happy Benchmarking!