A short explanation of generators with go channels - Mickaël Vieira - Software Engineer #go #javascript #python #rust

A generator in computer science is a function, sometimes called a subroutine that produces values to iterate over. Those values can be the result of a computation. The iteration ends either when the generator stops producing values or when the caller terminates it explicitly (with the break keyword for instance).

With Javascript, we can declare a generator with the function keyword followed by an asterisk.

// main.js

function* gen() {
  yield 1;
  yield 2;
  yield 3;
  yield 4;
  yield 5;
}

function main() {
  g = gen();

  // We can use next() to get the next value,
  // it returns an object with the following shape:

  // interface IteratorResult<T> {
  //   done: boolean;
  //   value: T;
  // }

  console.log(g.next().value);
  console.log(g.next().value);
  console.log(g.next().value);
  console.log(g.next().value);
  console.log(g.next().value);
}

main();

We’ll get the following output.

$ node main.js
1
2
3
4
5

A generator is an iterator; therefore, we can use a for...of loop to get the same output.

function main() {
  for (const i of gen()) {
    console.log(i);
  }
}

Go does not have generator functions as a language construct. It is nonetheless possible to mimic a similar behaviour with a channel and a goroutine. Going back to the Javascript example, we can write a regular function that returns a channel to send integers over and get the same result.

// main.go

package main

import "fmt"

func gen() <-chan int {
	c := make(chan int)

	go func() {
		c <- 1
		c <- 2
		c <- 3
		c <- 4
		c <- 5

		close(c)
	}()

	return c
}

func main() {
	g := gen()

	fmt.Println(<-g)
	fmt.Println(<-g)
	fmt.Println(<-g)
	fmt.Println(<-g)
	fmt.Println(<-g)
}

Running this code, we get:

$ go run main.go
1
2
3
4
5

The same way, we can iterate over the channel with range.

func main() {
	for i := range gen() {
		fmt.Println(i)
	}
}

Ok that’s cool but isn’t it simpler to return an array?

When a function returns an array containing all the values, those values are all stored in memory; a generator, on the other hand, returns the values one at the time which is much more efficient in terms of memory allocation. So in cases where we want to avoid filling up the memory, generators are much better suited. A generator also gives us much more control over the iterator’s behaviour.

But let’s see how we can use a generator with a concrete example. Imagine we want to process some data stored in CSV files. To look up the files, we can write a function that reads recursively a directory, finds the files, and returns a slice containing the lines.

func walk(dir string) []*line {
	files, err := ioutil.ReadDir(dir)
	if err != nil {
		panic(err)
	}

	var lines []*line

	for _, f := range files {
		path := fmt.Sprintf("%s/%s", dir, f.Name())
		if f.IsDir() {
			r := walk(path) // recursively read the directories
			lines = append(lines, r...)
		} else if strings.HasSuffix(f.Name(), ".csv") {
			r := getLines(path) // get the file's lines
			lines = append(lines, r...)
		}
	}

	return lines
}

The getLines function opens the file and returns its content.

func getLines(path string) []*line {
	f, err := os.Open(path)
	if err != nil {
		panic(err)
	}
	defer f.Close()

	var lines []*line
	reader := csv.NewReader(f)

	for {
		r, err := reader.Read()
		if err == io.EOF {
			break
		}
		if err != nil {
			panic(err)
		}
		// Transform and append the line to the slice
		lines = append(lines, transform(r))
	}

	return lines
}

To keep it simple, we define a struct line to represent a line in a file; the transform function merely turns the slice of strings returned by the CSV reader into a struct.

type line struct {
	col1 string
	col2 string
	col3 string
}

func transform(l []string) *line {
	return &line{
		col1: l[0],
		col2: l[1],
		col3: l[2],
	}
}

We can then use our walk function to get all the lines.

func doSomethingWithTheLine(l *line) error {
	fmt.Println(l)

	return nil
}

func main() {
	// Load the lines into a slice and loop through it
	for _, line := range walk("./path/to/csv/files") {
		if err := doSomethingWithTheLine(line); err != nil {
			panic(err)
		}
	}
}

This code works, however, since the function loads all the lines into memory, the execution can be quite inefficient and take a lot of memory. Depending on the use case, waiting for the function to return in order to start processing the values can also be awkward.

Knowing how to create a generator and leverage go channels, we have a nice opportunity to modify the walk function and optimise the code execution. Instead of returning a slice of lines, we can use a channel, read the directory within a goroutine, and send the lines over the channel.

func walk(dir string) <-chan *line {
	// Create a channel to send the lines over
	out := make(chan *line)

	go func() {
		files, err := ioutil.ReadDir(dir)
		if err != nil {
			panic(err)
		}

		for _, f := range files {
			path := fmt.Sprintf("%s/%s", dir, f.Name())
			if f.IsDir() {
				// walk through the child directory
				// and send the lines over the parent channel
				for p := range walk(path) {
					out <- p
				}
			} else if strings.HasSuffix(f.Name(), ".csv") {
				// read the file's content
				for _, r := range getLines(path) {
					// send the lines over the output channel
					out <- r
				}
			}
		}

		// all the files have been read
		// we can close the channel
		close(out)
	}()

	return out
}

range on channels does not provide the index so we need to slightly change the main function.

func main() {
	for line := range walk("./path/to/csv/files") {
		if err := doSomethingWithTheLine(line); err != nil {
			panic(err)
		}
	}
}

With this change, the function spits the lines out as soon as they are discovered and they can be processed immediately; the memory allocation should also remain at a sensible level and the program can be run safely.

However, there are subtle differences to bear in mind. A built-in generator computes its yielded value on demand whereas Go will send the next value over a channel as soon as the previous value has been received; the computation timing is slightly different. Furthermore, Go channels can be created with a buffer size; at runtime, values are sent over the channel until the buffer is full, causing those values to be kept in memory in such a situation.

In any case, using channels to generate data is a powerful pattern to help us control the code execution.