Lazy evaluation with PHP Generators


After learning different concepts from functional languages, I came to realize that some of them can be applied to PHP code as well. I lastly wrote a small library helping to process CSV data structures. I wrote functions to process collections of data to return Generators. PHP Generators are a kind of iterators, which generate values only as they are needed (i.e. the iterator generates values lazily). All of the functions are designed in a way that they can be combined to a "pipeline". The pipeline itself is a function which returns a Generator. This concept of combining functions to build more complex functions is also called function composition. This is a core concept of any functional programming language, most often there is a very elegant syntax for it. An example in Haskell:

numOfWords = length . words

Here, length is a function which takes a List and returns its size; and words is a function which splits a string into a List of words. I just defined a new function, numOfWords, which composes this two functions into a function which counts the number of words in a string.

While there is no such elegant syntax for this in PHP, the concept can still be applied. Under the hood, the . operator in Haskell is a function itself, which takes two functions as input, and returns a new function. PHP allows us to build such a function ourselves:

function compose($fnA, $fnB): \Closure
{
  return function ($input) use ($fnA, $fnB) {
    return $fnB( $fnA($input) );
  }
}

This does the same as the . function is haskell, except that it applies the functions from left to right instead of right to left. I think it's easier to understand this way in PHP code.

So, to summarize this introduction, we will talk about this two concepts in the following article:

  • lazy evaluation (using PHP Generators)
  • composing functions

If this concepts are new to you, at this point you're probably asking what advantages this concepts can bring to you. By explaining the code I wrote for my CSV Tools, I hope you see where this concepts can be used in the real world to improve PHP code.

What is "functional programming", and what is Haskell?

One or two years ago, when I heard the term "functional programming", I was thinking about something like PHP without objects and classes. I fairly underestimated this paradigm. I don't want to explain it in detail here, but I think the main differences compared to OOP are:

  • Data and operations are strictly separated. FP (functional programming) also has some kind of association between operations/functions and data structures (defined by the function's signatures), but data and functions are not so tightly coupled as in Object oriented classes with its properties and methods.
  • Due to the lack of "living" objects, state can not exist the same way as it does in objects. There are no variables holding values. (Of course you can name your values, and have functions which accept parameters, but you can not have variables in the sense of a reference to a mutable value).
  • The core of a program should be built using "pure" functions, which means they are not allowed to communicate to the outside world. They only transform an input to an output.

If you don't have any experience in FP, you probably find this things very restricting. On the other hand, we all learned that (the right) constraints greatly help building maintainable, secure and performant applications, and the constraints of FP are no different. And it turns out, it is possible to build applications with this constraints! But it is a very different paradigm of programming. That's why it is often not easy for experienced OO Programmers to learn a real functional programming language. But it is also the reason why it is very interesting, and valuable to think about this concepts, to see if -- and how -- we can learn from them.

Haskell is a very well known functional programming language, which I'm currently learning. Throughout the article, I will use a few code examples in Haskell, which I think are not too difficult to understand (with some explanation). If you have questions about them, just ask, and I'm happy to explain them in some more detail.

PHP Generators: A brief introduction

Now, back to PHP. There are two things you need to know when using generators in PHP:

  • Generator is a type, but not really a class. It can not be instantiated with the new keyword.
  • Objects of type Generator are created by functions which use the yield keyword.

Example:

function randomInteger($from, $to, $numOfValues = 10): \Generator
{
  while ($numOfValues > 0) {
    echo "here you get the next integer value..";
    yield rand($from, $to);
    $numOfValues--;
  }
}

// does not generate any output...
$myRandGenerator = randomInteger(100, 199);

Here we first defined a function randomInteger which creates a generator object. When calling this function in the last line, we do not yet execute the body of this function. That is, rand is not called in the above code! Neither do we see any output echoed. But, we now have a variable myRandGenerator which references a Generator object. As this type implements the Iterator interface, we can get the current value with the current method, and forward the iterator with the next method:

// This executes the body of the `randomInteger` function until it knows
// the first value to "yield".
// It prints "here you get the next integer value..." to the console
$val = $myRandGenerator->current();

// prints an integer value in the range of 100..199
echo $val;

// prints the same as $val, but does not "echo" anything
// from the body of the generator function
echo $myRandGenerator->current();

// forwards the pointer, continues to evaluate the generator function
// until it knows the next value to "yield". Therefore, it
// again prints the "here you get ..." message.
$myRandGenerator->next();
// prints another integer
echo $myRandGenerator->current();

When I first used this feature, it took me some time to get used to this execution model. If generators are new to you, you would probably want to try a little bit with this things to get used to it. For this article though, it's sufficient to you understand the theory. So let's continue.

How Generators allow "lazy evaluation"

The main advantage of generators is, that they allow to process lists of values using multiple functions without calculating the whole list upfront. We're all used to this kind of programming:

// returns an array with the file's lines
$words = file('/usr/share/dict/words');

function countWordsWithLetter($letter, array $words)
{
  $cnt = 0;
  foreach ($words as $word) {
    // is $letter part of the $word?
    if (false !== strpos($word, $letter)) {
      $cnt++;
    }
  }
  return $cnt;
}

echo countWordsWithLetter('f', $words);

The statements are executed sequentially. First, file reads the given file (a file listing all the words of a system's dictionary) into an array. Each line of the file is an array item. Then we pass this array to the counting function (countWordsWithLetter). The problem is, that we need to read the whole file into memory to be able to pass the array of words to the counting function. We can do better with the help of a generator:

// this function returns a generator which produces a new
// word from the dictionary file on each iteration
function readFileLines($path): \Generator
  $fileHandle = fopen($path, 'rb');
  while (false !== ($ln = fgets($fileHandle))) {
    yield $ln;
  }
}

// this is now our generator
$words = readFileLines('/usr/share/dict/words');

function countWordsWithLetter($letter, \Generator $words)
{
  $cnt = 0;
  foreach ($words as $word) {
    // is $letter part of the $word?
    if (false !== strpos($word, $letter)) {
      $cnt++;
    }
  }
  return $cnt;
}

echo countWordsWithLetter('f', $words);

As a Generator implements the Iterator interface, we can iterate over it with foreach, the same way as we did with arrays. The main difference is how we read the file. We read it line by line. The execution sequence is now:

  • When calling readFileLines we get back an object of type Generator. This is done without executing the body of the function readFileLines.
  • We call the function countWordsWithLetter, passing the character f and the Generator $words as arguments.
  • In the function's body, there is a foreach, so it needs to evaluate a first value from the generator, to run the first iteration.
  • In order to do this, the readFileLines function body is executed.
  • Now is the time, the file /usr/share/dict/words is opened. Note that if we would not have called foreach on the generator object $words, the file would never be opened/read!
  • In the readFileLines function, We read the first line of the file with fgets. Then we "yield" that line. This is all we need to do for now, so the evaluation of this function stops here, and we go back to the foreach loop of the countWordsWithLetter function.
  • If the letter "f" happens to be part of the read $word, we increase the counter $cnt.
  • Then we go the the next iteration, so we need to fetch another item from our generator $words.
  • This means we continue with executing the readFileLines method, until we have another word. This is done by iterating in the while loop, and reading the next line with fgets.
  • The program continues with the foreach loop of countWordsWithLetter, and from now on keeps switching between the two loops of the functions readFileLines and countWordsWithLetter to read and process the file line by line.
  • Important: After each iteration we don't need the previous line anymore, so we do not need to keep the whole file in memory!
  • When done reading the file, fgets returns false. This closes the generator, and calling the iterator method valid on it will return false (this is done internally by the foreach loop, so it knows when to stop).
  • The foreach now stops iterating, and the final $cnt value is returned and printed with echo.

I hope this gave you an idea what lazy means in this context: The values are only read from the (file) input when needed. "Needed" means that we force evaluation by accessing the contents of a generator. The opposite, by the way, is called eager and means that we "eagerly" read the whole file upfront (as in the previous version, where we read the file into an array using the file function).

Composing functions which operate on Generators

The above example was relatively simple, in that it only used the generator to calculate a final value (the count value). In my use case, I wanted to flexibly process data sets read from CSV files. I wanted to do -- and combine -- operations like:

  • filtering
  • mapping values
  • selecting fields
  • run aggregate functions over all rows (e.g. counting, or summing the values of a given column).

In order to do that, we need a common way to abstract all of this operations. We can then compose them to a pipeline. Let's define a signature for all our data transforming functions:

function myOperation(\Generator $inputData): \Generator

Each operation takes a Generator, does some processing, and returns a Generator. Now we define a function compose which combines those operations to a "pipeline":

function compose(...$ops): \Closure
{
  return function (\Generator $data) use ($ops): \Generator {
    foreach ($ops as $operator) {
      $data = $operator($data);
    }
    return $data;
  };
}

The compose function takes an array of operation functions as inputs. All of this functions should have a signature as defined above: Take a Generator as input and return another Generator. We then return a function (Closure) which itself adheres to this operation interface: It takes a Generator as input and returns a Generator. The transformation it does is to just apply each of the input operations to the data set, one after the other. This is implemented here by iterating over the $ops array, so it combines the operations from left to right. We use the ... operator in the function signature, which just means that we can pass as many arguments as we want, and they are accessible as an array in the function body.

Let's see an example of how the compose function can be used:

$opA = function(\Generator $data): \Generator
{
  $cnt = 0;
  foreach ($data as $row) {
    $cnt++;
    echo "apply operation A to row $cnt";
    $newRow = doSomething($row);
    yield $newRow;
  }
};

$opB = function(\Generator $data): \Generator
{
  $cnt = 0;
  foreach ($data as $row) {
    $cnt++;
    echo "apply operation B to row $cnt";
    $newRow = doSomething($row);
    yield $newRow;
  }
};

// the same pattern here
$opC = ... 

$pipeline = compose($opA, $opB, $opC);

We now have a $pipeline which is a function to transform an input Generator to an output Generator. It does this by applying the operations $opA, $opB, and $opC, one after the other to each row of our input generator. If we would run this, the output would be:

apply operation A to row 1
apply operation B to row 1
apply operation C to row 1
apply operation A to row 2
apply operation B to row 2
apply operation C to row 2
...

But that's a very limited signature for operations...

For this to work, we need to have this rather strict constraint on the signature of operations. Say we want to have a "select fields" operation, which selects some fields / columns of the input rows of our CSV file.

// this is an example data set (here written as array instead
// of a generator)
$data = [
  [
    'id' => 111,
    'name' => 'lise',
  ],
  [
    'id' => 112,
    'name' => 'claudio'
  ],
];

We want to be able to just select the name field and remove the ids. The first implementation would look something like the following:

/**
 * Filters the rows yielded by the $data generator
 * and removes all "columns" which are not contained
 * in $fields
 */
$select = function (array $fields, \Generator $data): \Generator
{
  foreach ($data as $row) {
    // a $row is a map of fieldname => value
    $newRow = [];
    foreach ($row as $f => $value) {
      if (in_array($f, $fields)) {
        $newRow[$f] = $value;
      }
    }
    yield $newRow;
  }
};

We have a problem here: this operation does not match the expected signature.

... Closures to the rescue!

We can write a function which kind of converts the actual signature to the expected one:

function buildSelectOp(array $fields): \Closure
{
  // we return a function that matches the expected
  // signature (Generator -> Generator).
  // The additional parameter $fields is attached to the
  // closure using the `use` keyword.
  return function (\Generator $data) use ($fields) {
    foreach ($data as $row) {
      $newRow = [];
      foreach ($row as $f => $value) {
        if (in_array($f, $fields)) {
          $newRow[$f] = $value;
        }
      }
      yield $newRow;
    }
  };
}

// here we create an operation using the 'buildSelectOp' function.
$selectOp = buildSelectOp(['name']);

// ... then use it as a normal operation to build a pipeline
$pipeline = compose($opA, $opB, $selectOp);

Problem solved! We've splitted the two input parameters of the original $select function to two function calls: The $fields parameter is passed to the buildSelectOp function, which returns a closure function (here stored in the $selectOp variable). Later, when the whole pipeline is evaluated, the second parameter (the generator $data), is passed to this closure function to calculate the final (or next) generator. Because this is difficult to understand the first time, here's another view of our new buildSelectOp function:

$data = someInputGenerator();
$newGenerator = (buildSelectOp(['name']))($data);

// is the same as:
$data = someInputGenerator();
$newGenerator = $select(['name'], $data);

With buildSelectOp we can partially apply our arguments! First the fields (which are known upfront) and then -- later, when the whole pipeline is evaluated -- the generator which yields the data to operate on. This is again a very common pattern in functional programming. In Haskell for example, every function takes exactly one parameter and returns exactly one result value. Functions which look like they would take two parameters, are actually something like a "Closure returning" function, similar to buildSelectOp. A simple example is:

-- this takes two arguments, x, and y, and returns the sum
add x y = x + y

This looks like there are two arguments (I've even written this in my comment...!). In reality though, this is just syntactic sugar, and add is a function which takes one argument, x, and returns a function (a "Closure"), which also takes one argument, y, and returns the sum of x and y. While this sounds unnecessarily complicated at first, it turns out to be very useful when it comes to composing functions in a flexible way! We don't have such a nice syntax to do this in php, but with Closures we have the possibility to do the same thing.

Further examples

If you want some more examples to get used to this style of programming, please read the code of the CSV tools library, which I wrote recently. The use case I had was that I wanted to do some analysis of CSV data, either by a simple script or even interactively (I use the psysh REPL for interactive PHP programming).

When and why to use Generators and function composition

Lazy evaluation is very useful if you operate on possibly large sets of data, and you don't need to have them all in memory. You can save a lot of memory this way.

Function composition is useful if you are transforming data in a series of steps. The type of functions that make sense to "compose" are often those you would pass as callback to functions like array_map, array_reduce, array_filter, and so on. Additionally, function composition offers the possibility to build more complex operations in terms of this "simple" transformation steps. (This functions do not necessarily have to operate on Generator data structures, it's also possible to compose functions that operate on other Iterators or arrays).

When starting with my CSV tools, I had a clear use case in mind: processing CSV data. Later, I realized that I'm building a much more general library to define (possibly complex) data transformation logic. This is because the "high level" operations for processing data are mainly mapping, reducing (or "folding"), and filtering. What distinguishes applications are the definition of the actual filter, mapping, reducing operations, which are passed as callbacks by the client code.

That's what comes to my mind right now, but in theory, you can do almost everything with this tools. I want to encourage you to think yourself, when you're coding the next time, if this concepts could be useful for what you're building. You may also want to have a look at Pramda, a library which also implements some useful functions to build processing pipelines.

Conclusion

While there are certain limitations, we can use many of the concepts from functional programming in PHP. I hope I could show that lazy evaluation and function composition are really useful tools when applied to the right problems.

Author: Claudio Kressibucher
Tags: PHP, Functional Programming