Improving performance by avoiding LIMIT in large data sets

This is a repost of http://www.king-foo.com/2015/09/improving-performance-by-avoiding-limit-in-large-data-sets/.

Whether you are using MySQL, MongoDB or another data store, you will have some kind of paging mechanism to scroll through your data set.

This mostly boils down to:

  • Order the data set in a way;
  • Skip the first X records (the offset);
  • Select the X following records (the limit);
SELECT * FROM users LIMIT 14500,500;

You get the impression that only a slice of the data set is used/returned, but on the server side, a lot of work has been done.

This works quite well for the beginning of the data set, but performance will quickly drop once you start to skip 100.000 rows and select the next 100 records. For paginating search results this is acceptable, because most users won’t get to page 6, but for iterating over a complete data set, this is disastrous.

The solution

Instead of using paging, you can implement some kind of scrolling or cursoring.

If you look at Twitter, they use a max_id to enable this and Elasticsearch has a scan and scroll mechanism for this.

Now you have two options to implement this for your local database:

  • Completely do the selection with a WHERE clause;
  • Or combine a WHERE clause with a LIMIT;

WHERE clause

The easiest way to scroll through your data set, is to call the MAX(id), and fetch pages until you reach that max id.

SELECT MAX(id) FROM users;
SELECT * FROM users WHERE id >= 0 AND id < 500;
...
SELECT * FROM users WHERE id >= 14500 AND id < 15000;
...

Using this approach, you can only stop fetching once you reach the max id, because some of the pages can be empty..

This approach does not need an ORDER clause, because you set the range you want back.

WHERE clause combined with LIMIT

This approach can be used to display pagination on a website, because you cannot have empty pages.

SELECT COUNT(*) FROM users; --- Optional, but needed if you want to show how many pages you have
SELECT * FROM users LIMIT 500 ORDER BY id; --- Last ID is here 543
SELECT * FROM users WHERE id > 543 ORDER BY id LIMIT 500 ORDER BY id; --- Last ID is here 2003
SELECT * FROM users WHERE id > 2003 ORDER BY id LIMIT 500;
...

The drawback here is that you have to sort the data set, but this is still much quicker that the OFFSET,LIMIT approach.

PHP TestFest 2017

Prepare for PHP Test Fest 2009

It’s Wednesday 7/6/2017 and Ben Ramsey emails the UG-ADMIN mailinglist with the subject “PHP TestFest 2017”. I instantaneously think about the 2009 edition which I attended (at Combell, organized by PHPBelgium).

I had a great time during that edition (I even managed to get some test code into PHP source) and was wondering if we could join this years edition with a couple of local user groups.

I co-organize PHP-WVL and currently work for Combell (based in Ghent), so after some chatting with the folks of Ghent PHP, we agreed to join forces and organize this years edition back at Combell.

As the date we’d picked came closer, I had to figure out how to test the code and get them into PHP Source these days, so I could “mentor” the visitors. I still remembered how to write phpt files, but had to figure out the rest of the flow.

I started by listening to PHP Roundtable episode 65 in the car, to get up to speed. Next up: the PHP TestFest website and the excellent video series of Sammy.

For specific questions, the mailing list is perfect, which helped me with some questions about code coverage.

I figured out there is a fork of the PHP source where all pull requests with tests should be created: https://github.com/phpcommunity/phptestfest-php-src/pulls.

Figuring out what to test.

To find code that needs testing, you have a couple of options:

When you look for uncovered code on the gcov site, there are some things you need to take into account:

  • It is possible that the gcov server does not have all PHP modules loaded (making it appear lots of code fragments are uncovered), see — SKIPIF —;
  • Some tests use “exec” or “popen”, making it “untrackable” for the coverage tool (the cli_server test appear to be uncovered for this reason);
  • Not all testfest tests have been merged.

To solve the last problem, I created a fork of the phptestfest fork, where I merged all the approved pull requests in a branch. This way, I can render code coverage for a specific path, including all the phptestfest tests from other users.

Our contribution.

After 2 weeks, we managed to submit 13 pull requests, where most were approved!

#phptestfest

Resources

git-flow, DTAP & semver

Using git-flow in a DTAP environment.

I’m a big fan of Git and I tend to use git-flow in most of my projects. A lot of people don’t like it, because it can be very tedious, but even if you only use develop and master branches, you’re already benefiting from it.

If you have multiple environments to deploy to, you can leverage git-flow logic to automatically deploy to these environments (in our case DTAP: development, testing, acceptance and production). Using a CI/CD server (like Gilab CI), we can auto deploy to specific environments when code gets pushed to the server.

Definitions

Let’s have a look at some definitions first.

git-flow

http://nvie.com/img/git-model@2x.png

  • develop
    • The main development branch
    • Ideally this branch only contains merge commits
  • master
    • This branch represents code in production
    • This branch must only contain merge commits
  • feature/*
    • New development is done in a feature branch
    • This branch is started from develop and will be merged back into develop when it’s done
    • By using a feature branch, you can leverage merge request
  • release/*
    • This branch sits between develop and master (getting ready for a new production release)
    • This branch is short-lived and totally optional
  • hotfix/*
    • When something is wrong in production, a fix can be produced via a hotfix branch
    • This branch is started from master and will be merged into both master and develop

DTAP

  • Development: local development
  • Testing: beta server
  • Acceptance: alpha server
  • Production: live

Semantic versioning

Semantic versioning is a way of applying version to your software so it’s clear what impact it may have. As described on the homepage (semver.org):

Given a version number MAJOR.MINOR.PATCH, increment the:

  • MAJOR version when you make incompatible API changes,
  • MINOR version when you add functionality in a backwards-compatible manner, and
  • PATCH version when you make backwards-compatible bug fixes.

Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

When applied to git-flow, version are added in the form of git tags, and (for now) only the master branch receives tags.

When you merge develop or a release branch into master, you add a MAJOR.MINOR tag to it (only increment the MAJOR number is the change is big enough). When you merge a hotfix branch into master, increase the PATCH version.

In practice

Now we can configure our CI/CD system to start when we push code to a specific branch. In the Gitlab CI case, it is possible to limit jobs to branches with “only”.

  • develop: auto deploy to Testing
  • release/*: auto deploy to Acceptance
  • master: auto deploy to Production

Tag based CI/CD

You could even limit job execution by using tags. Instead of starting jobs when you push code to a specific branch, you could let developers push to specific branches and kickstart a job by tagging it accordingly (which can be used to throttle builds and/or limit target environments).

Semantic versioning allows additional labels for pre-release and build metadata which we can leverage (1.0.0-alpha < 1.0.0-alpha.1 < 1.0.0-alpha.beta < 1.0.0-beta < 1.0.0-beta.2 < 1.0.0-beta.11 < 1.0.0-rc.1 < 1.0.0).

If you have extra environments (i.e. Education, Backup, Staging, etc…) you now can use even more pre-release tags:

Bonus points

Since we’re using semantic versioning and a build system, we can auto append build metadata to our version numbers. In the case of Gitlab CI, you can write a file containing a version number in the form of “1.0.0-rc.2+15” by using the git tag and the CI_JOB_ID variable.

Sort Composer packages for your git merging pleasure

A lot of git merge conflicts occur when multiple developers add lines to the end of a file/list. This also happens in composer.json, AppKernel.php, translation files, config/application.config.php, CSS files, CHANGELOG, etc…

diff --cc composer.json
index 62e875e,0c526d8..0000000
--- a/composer.json
+++ b/composer.json
@@@ -10,6 -10,6 +10,10 @@@
      "require": {
          "ramsey/uuid": "^3.0",
          "roave/security-advisories": "dev-master",
++<<<<<<< HEAD
 +        "beberlei/assert": "^2.3@dev"
++=======
+         "miljar/php-exif": "dev-master"
++>>>>>>> exif
      }
  }

A nice solution is that every developer adds lines or blocks to random places in a list or a file, but a better solution is to sort these lines (especially for lists like composer.json, translation files, etc…). This way you insert new lines in “random” places, keeping it clear for everyone how things are added.

diff --git a/changelog.rst b/changelog.rst
index d9db648..886b168 100644
--- a/changelog.rst
+++ b/changelog.rst
@@ -19,7 +19,9 @@ June, 2015
 New Documentation
 ~~~~~~~~~~~~~~~~~

+* `#5423 <https://github.com/symfony/symfony-docs/pull/5423>`_ [Security] add & update doc entries on AbstractVoter implementation (Inoryy, javiereguiluz)
 * `#5401 <https://github.com/symfony/symfony-docs/pull/5401>`_ Added some more docs about the remember me feature (WouterJ)
+* `#5384 <https://github.com/symfony/symfony-docs/pull/5384>`_ Added information about the new date handling in the comparison constraints and Range (webmozart, javiereguiluz)
 * `#5382 <https://github.com/symfony/symfony-docs/pull/5382>`_ Added support for standard Forwarded header (tony-co, javiereguiluz)
 * `#5361 <https://github.com/symfony/symfony-docs/pull/5361>`_ Document security.switch_user event (Rvanlaak)

Full diff on Github

Some tools and hacks that assist you with this:

Something that does not work for composer.json, but something I highly recommend, is to add trailing commas in PHP arrays and to not align code (see the blogpost by Made With Love about “How clean are your diffs?”). This eases up merging too!

Happy merging!

Controlling a receipt printer with ZeroMQ and PHP

When you create a POS system as a web app, you have the problem that not all the devices are controllable from your server.

In my case I have the following hardware connected to the workstation:

The input hardware is easy, because a barcode scanner is seen as a keyboard, and both devices can communicate to the server through the web browser.

The tricky part is when a sale is completed, and the customer wants a receipt.

Since the printer is connected to the (dummy) workstation and not to the (remote) server, we cannot print the receipt directly.

The solution I used for this problem is ZeroMQ.

On the server I have a PHP process running which binds to 2 ZeroMQ sockets. One (ZMQ::SOCKET_PULL) is waiting for incoming print request from the web app, one (ZMQ::SOCKET_PUB) is publishing a print request to all subscribing workstations.

On the workstation (in my case a Windows laptop with an HP receipt printer connected to it) I have another PHP process running which is waiting for print jobs (on a ZMQ::SOCKET_SUB socket).

<?php
/**
 * Receive a print request and print.
 */
$context = new ZMQContext();
$sub = $context->getSocket(ZMQ::SOCKET_SUB);
$sub->setSockOpt(ZMQ::SOCKOPT_SUBSCRIBE, '');
$sub->connect('tcp://server:5566');

while (true) {
    $printJob = $sub->recv();
   
    $html = file_get_contents($printJob);
    $file = get_temp_dir() . '/printJob_' . sha1($html) . '.html';
    file_put_contents($file, $html);

    // http://windowsitpro.com/systems-management/how-can-i-print-usb-printer-command-prompt
    exec('print /d:LPT2: ' . $file);
}

I can add the final piece of the puzzle to the web app, by opening a ZMQ::SOCKET_PUSH socket and sending the URL of the receipt page.

<?php
/**
 * Send a print request to the print-dispatcher.
 */
$context = new ZMQContext();
$push = $context->getSocket(ZMQ::SOCKET_PUSH);
$push->connect('tcp://server:5567');

$push->send('http://server/sale/receipt/' . $receiptId);

Now I have created a lightweight system for controlling the receipt printer, instead of using cronjobs (or scheduled tasks) to check for print jobs.

Some remarks:

  • I could remove the print-dispatcher part, and let the web app connect to the print receiver directly, but I prefer to have a stable part (the binding sockets) on the known server (so both connecting sockets know the host to connect to).
  • The HTML page could be transported over ZeroMQ, but I like the extra request so the web app is sure the receipt is printed.

Using multiple databases in phpunit/dbunit with composer

I’m using multiple databases in most of my projects, so having access to multiple databases in my test suite is a must.

phpunit/dbunit is excellent, but you are stuck with one database. The guys at Etsy created very good extensions to fix this problem (MultipleDatabase), but it took me a while to figure out how to use it.

Because PHPUnit is now available via Composer, you can fetch all dependencies with a single command.

The composer.json file to include all dependencies looks like this:

{
    "repositories": [
        {
            "type": "package",
            "package": {
                "name": "hamcrest/hamcrest",
                "version": "1.1.0",
                "dist": {
                    "type": "zip",
                    "url": "https://hamcrest.googlecode.com/files/hamcrest-php-1.1.0.zip"
                },
                "include-path": ["Hamcrest-1.1.0/"],
                "autoload": {
                    "psr-0": { "Hamcrest_": "Hamcrest-1.1.0/" },
                    "files": ["Hamcrest-1.1.0/Hamcrest/Hamcrest.php"]
                }
            }
        },
        {
            "type": "package",
            "package": {
                "name": "etsy/phpunit-extensions",
                "version": "0.6.0",
                "dist": {
                    "type": "zip",
                    "url": "https://github.com/etsy/phpunit-extensions/archive/v0.6.0.zip"
                },
                "autoload": {
                    "psr-0": { "PHPUnit_": "" }
                }
            }
        }
    ],
    "require": {
        "php": ">=5.3.2"
    },
    "require-dev": {
        "hamcrest/hamcrest": "1.1.0",
        "mockery/mockery": ">=0.7.2",
        "etsy/phpunit-extensions": "0.6.0",
        "phpunit/phpunit": "3.7.*",
        "phpunit/dbunit": "1.2.*"
    },
    "include-path": ["vendor/mockery/mockery/library"]
}

Install the dependencies by (installing and) running composer:

cd $PROJECT_ROOT
curl -s https://getcomposer.org/installer | php
php composer.phar install --dev
vendor/bin/phpunit --version

Now you can create a “parent” class to register the databases (mine is called DatabaseTest). Make sure you create a getDatabaseConfigs method (which is required and should return an array of PHPUnit_Extensions_MultipleDatabase_Database). For the fixtures, I use Xml Datasets, which look like this.

I’ve added a getConnection method, so I can use the same assertions as the normal dbunit testcase (see Database Assertions API):

$this->assertEquals(0, $this->getConnection('dbname')->getRowCount('user'), "Pre-Condition");
$user = new User();
$user->save();
$this->assertEquals(1, $this->getConnection('dbname')->getRowCount('user'), "Inserting failed");

The magic about to happen is quite cool. PHPUnit will read these database configs and use them to make sure all databases and tables are in a known state before every test and does this in following order:

  1. Connect to all databases
  2. TRUNCATE all tables supplied in the database fixture file
  3. Insert all rows supplied in the fixture file
  4. Execute the test
  5. TRUNCATE all tables supplied in the database fixture file

Now you’ll be able to run tests for code making changes in your database without affecting other tests.

Some improvements:

  • Add composer support to Hamcrest (once Hamcrest PHP has moved to GitHub).
  • Add composer support to the Etsy extensions (I’ve submitted a PR, but it won’t work until Hamcrest PHP is on GitHub).
  • See if the getConnection() method can be done in a better way.

SOAP-ERROR: Parsing WSDL

If you’re writing or consuming webservices with PHP SOAP, it’s possible you run into the SOAP-ERROR: Parsing WSDL problem once.

The complete error string is:
SOAP-ERROR: Parsing WSDL: Couldn't load from 'http://host/service?wsdl' : <specific error>

If you copy and paste the url (including the ?wsdl parameter) in the browser and you see the WSDL file, the problem lies in the fact that the PHP cannot reach the host.

When you connect to http://host/service, PHP fetches the XML from the WSDL page via fopen(‘http://host/service?wsdl‘) so it can use it to handle the request. In some cases, that request is not routed correct, resulting in the SOAP-ERROR.

Some solutions:

  • Add the hostname in the hostfile of the server (127.0.0.1 hostname).
  • Add the hostname or IP address in the correct VirtualHost (ServerAlias hostname).

You can test the code by adding a file on the server:

<?php
echo htmlentities(file_get_contents('http://host/service?wdsl'));

That way, you know if the server can reach and read the XML file.

Keeping track of your config files

I know and use Subversion properties (and the keyword substitution) for quite a while now, but never used all of them and mostly stayed with the Id keyword.

This results in a substituted string like this:

$Id: ProductController.php 227 2010-04-28 08:25:32Z jachim $

Because my colleague Arno and myself do a lot of server maintenance and configuration, we ended up maintaining a lot of configuration files in a dedicated repostitory. The big problem here is the fact that you need discipline to check in the changes in Subversion and exporting them into place (manually or with a hook).

In order to help us pointing out which files are in the repository and where, we’ve added 2 keywords in every file:

// $HeadURL$
// $Id$

Which gets nicely transformed into usefull information on the servers’ filesystem:

// $HeadURL: http://server/trunk/dnsserver/var/named/chroot/etc/zones/dmz.zones $
// $Id: dmz.zones 1889 2010-05-31 12:26:20Z jachim $

3 years ago in Cambridge

Studio 24 logoIt has been around three years now, since I started my internship at Studio 24.

Studio 24 is a Cambrigde (UK) based web agency with over 10 years of expierence, happy to share their knowledge with students willing to learn.

In the last year of my training MCT at Howest (University College West Flanders), I decided to do my internship abroad, so I ended up in the beautiful Cambridge for over 3 months.

With the support of my house mates Leena (SE), Chris (AU), Matthew (UK) & Alex (UK) and my bosses/colleagues Simon, Jonathan, Richard, Edward, David, Dave & Gaëlle, I found my way through Cambridge and the professional web world.

A lot of what I learned in those 3 months, I still use and advocate/implement in my current company. Some examples of technologies, procedures, software, etc… I discoverd or worked with:

Using all these things and the variaty of customers Studio 24 has, resulted in an astonishing list of projects I worked on. So much, I only could use a thirth of them in my presentation for the jury back at home.

Here’s an extraction:

I had a great time, and encourage every student to have a look at the Lifelon Learning programme of the EU, which helped me pay for my accommodation.

Looking back, I’m glad everyone around me ‘pushed’ me in the Cambridge adventure and I would do it again, if I had the change!

Deleting Subversion repository files (for real)

Keeping files and directories in the repository is one of the key principles of Subversion, so once you’ve committed something, it’s there for ever. You can delete files, but they still exist somewhere in the repository, so you can go back in time.

But there is always that time where you’ve (accidentally) committed a password file, a directory full of hi-res images, or some other contents you don’t want other people to see that you want to get rid off. That’s where the hard part starts…

After searching the internet and checking the Subversion FAQ it looks quite hard, but with some guidance, you’ll find out it’s not.

Finding the problems

First you have to do a (complete) checkout of the repository you want to clean:

svn co http://svn.apache.org/repos/asf/ asf

Now you can start to locate the problems and delete the files/directories (not svn delete!):

rm -Rf subversion/trunk/tools/buildbot;
rm -Rf subversion/trunk/README;
rm -Rf subversion/trunk/build;

When you’re done delete files and directories, you can generate a list of ‘missing’ files.

Checking your files:

svn status
!      subversion/trunk/tools/buildbot
!      subversion/trunk/README
!      subversion/trunk/build

Generating that list (outside the working copy):

svn status | sed s/"!      "// > ../filter.txt

Fixing the problems

Now you have a nice list of files to delete (make sure it includes the parent directories, right to the root), you should login on the server hosting the repository.

We first want to make sure there is a backup:

svnadmin dump file:///var/svn/asf > ~/backup_svn/asf.dump

Now we can use that backup file as the input of file for the svndumpfilter command. In combination with the filter list we’ve generated on the client, we can create a filtered dump version:

svndumpfilter exclude `cat filter.txt` < ~/backup_svn/asf.dump > asf_filtered.dump

To load that file back in the repository, we should ‘delete’ the original repository. (The httpd commands are just to make sure no one commits while processing the changes).

/etc/init.d/httpd stop;
mv /var/svn/asf ~/backup_svn/asf;
svnadmin create --fs-type fsfs /var/svn/asf;
svnadmin load /var/svn/asf &lt; asf_filtered.dump;
/etc/init.d/httpd start;

Please note that directories and command line options can be different, but the outcome should be the same.

Now we have the same repository, without the (accidentally) committed files/directories!

New problems

After the filtering, it is possible that complete revisions are empty. It is possible to skip empty revisions, but then all revisions are renumbered, and that could be problematic for other software (e.g. Trac).