London

16 High Holborn
London
WC1V 6BX
01235 821 160
View map

Edinburgh

1 Lochrin Square
92 Fountainbridge
Edinburgh
EH3 9QA
01235 821 160
View map

Dublin Office

77 Lower Camden Street
St Kevin’s
Dublin 2
Ireland
D 02 XE 80

View map

Administrative Office

BH Office
Church Street
Ardington
Wantage
Oxfordshire
OX12 8QA
01235 821 160
View map

Send us a message
CLOSE X
Contact Us
07.11.2021

Dynamic Array Formulas and Performing a ‘Left Join’ in Excel

Performing a left join is a simple operation for a database. It can also be done using the Power Query add-in in Excel. But can you do a left join on two tables of unspecified sizes with an Excel formula? At APR we ask questions like this one from time to time to see where it may lead us. In this case we learned something about dynamic array formulas. The following describes what we learned and a solution to the challenge.

Those who have delved into some of the more complex functionality of Excel are likely to have at some point in time encountered array formulas. Also known as “Ctrl-Shift-Enter” or “CSE” formulas (from the keyboard shortcut required to input them), they were used to perform calculations on one or more items in an array.

Legacy array formulas – as they are now known – were indicated by curly brackets, and required the entire output range to be selected upon entry (with the formula itself in the first cell of the range). For example, the formula {=A1:A10*B1:B10}, if entered in cells C1:C10, would return A1 * B1 in cell C1, A2 * B2 in cell C2, and so forth until A10 * B10 in cell C10. Array formulas which could be contained within a single cell were also possible – for example, the SUMPRODUCT function.

The sizing of these legacy array formulas was notoriously fiddly. If the output range was too small, the output would be truncated (with no indication that this had occurred), whilst if the output range was too large, then #N/A errors would result, and it was not possible to shrink an array formula once it had been entered without deleting the formula and starting over. The pitfalls of inappropriately sized legacy array formulas are illustrated in the screenshot below:

However, Microsoft addressed these issues in Excel 365 with the introduction of dynamic array formulas. Unlike multi-cell legacy array formulas, these formulas are able to be placed in a single cell, and will return as output a dynamic array which dynamically resizes (or ‘spills’) over multiple cells to fill the required amount of space. If the required amount of space is not available, a #SPILL! error will occur.

A dynamic array formula will automatically be generated by Excel 365 whenever a calculation takes place that involves one or more arrays, as seen above – the blue border indicates the range that the dynamic array output will cover.

Excel 365 also introduces a number of functions specifically intended for generating or working with dynamic arrays, including (but not limited to) SEQUENCE, SORT and FILTER.

The B3# syntax in the screenshot above refers to the dynamic array that starts in cell B3.

(Excel 365 also introduces the LET function; this does not relate specifically to array formulas, but is another very useful function which allows users to assign names to calculation results. This can significantly improve the readability of an otherwise intractable formula, and will be used in the ‘left join’ example given later in this article).

It’s not difficult to see how the introduction of dynamically resizing arrays removes some fairly significant limitations in the use of array formulas; an output from a cashflow model may output a  varying number of rows each time, but a dynamic array formula can handle this output smoothly without requiring significant spreadsheet reorganization or tedious resizing.

However, it’s not the case that dynamic array formulas are without their limitations; for example, linked dynamic array formulas between workbooks will return a #REF! error when refreshed if the source workbook is not open.

With these new tools in our arsenal (and some creativity), we are able to perform operations in Excel that may previously have been impossible without the use of VBA macros or other external functionality.

Performing a Left Join Using Dynamic Arrays

As a specific example, let us look at the concept of performing a ‘left join’ in Excel, and how this can be achieved using dynamic array formulas. Joins – a concept originating from relational databases – combine columns from multiple tables into a new table, ‘joining’ on a shared field. A left join, specifically, contains all rows from the left table, and any rows from the right table which matched an entry in the left table.

Consider the following two tables – one of books, with fields BookID, Title, and AuthorID, and one of authors, with fields AuthorID and Name (the duplication of ‘War and Peace’ in the Books table is intentional).

In SQL, the syntax for a left join would be as follows:

SELECT *
FROM Books
LEFT OUTER JOIN Authors ON Books.AuthorID = Authors.AuthorID

In Excel, to achieve something similar, we start with a couple of SEQUENCE functions, and then progress from there.

First, we consider the formula:
=SEQUENCE(ROWS(Books)*ROWS(Authors) + ROWS(Books))
(We’re using the named ranges Books and Authors to refer to the Books and Authors tables).

This will produce a dynamic array of length (10*8) + 10 = 90 – every possible pairing of rows from the two tables, and then an additional row for each of the books considered individually. In our LET function, we will refer to this array as x.

By performing certain operations on this array x, we can iterate through the various permutations of combinations of rows from the two tables.

The formula:
=LET(x, SEQUENCE(ROWS(Books)*ROWS(Authors) + ROWS(Books)), MOD(x – 1, ROWS(Books))+1)
will cycle through the integers 1 to 10, 8 + 1 = 9 times – corresponding with the 10 rows of the Books table, and the 8 rows of the Authors table.

The formula:
=LET(x, SEQUENCE(ROWS(Books) * ROWS(Authors) + ROWS(Books)), IF(x <= ROWS(Books) * ROWS(Authors), FLOOR((x-0.0001) / ROWS(Books),1)+1, NA()))
will repeat each of the integers 1 to 8 10 times consecutively, followed by 10 rows of ‘#N/A’.

These can then be combined with the INDEX formula to select the corresponding rows from the Books and Authors tables respectively.

=LET(x, SEQUENCE(ROWS(Books)*ROWS(Authors) + ROWS(Books)), INDEX(Books,MOD(x – 1, ROWS(Books))+1,SEQUENCE(1,COLUMNS(Books))))

=LET(x, SEQUENCE(ROWS(Books)*ROWS(Authors) + ROWS(Books)), INDEX(Authors, IF(x <= ROWS(Books)*ROWS(Authors), FLOOR((x-0.0001)/ROWS(Books),1)+1, NA()),SEQUENCE(1,COLUMNS(Authors))))

We can then nest these formulas inside our existing LET function, giving them the aliases ‘BookRows’ and ‘AuthorRows’ respectively. The following formula will return TRUE for the pairs of rows where there is a match on the AuthorID column between the Books and Authors tables; it will additionally return TRUE for the books within the Books table which do not have any match in the Authors table. (With a bit of additional thought, we could extend this formula to cover other types of inner and outer join).

IFERROR(INDEX(BookRows,,3) = INDEX(AuthorRows,,1),ISERROR(MATCH(INDEX(BookRows,,3),Authors[AuthorID],0)))

Almost there! The final steps are:

=LET(x, SEQUENCE(ROWS(Books)*ROWS(Authors) + ROWS(Books)),
y, SEQUENCE(1,COLUMNS(Books) + COLUMNS(Authors)),
BookRows,
INDEX(Books,MOD(x – 1, ROWS(Books))+1,SEQUENCE(1,COLUMNS(Books))),
AuthorRows,
INDEX(Authors, IF(x <= ROWS(Books)*ROWS(Authors), FLOOR((x-0.0001)/ROWS(Books),1)+1, NA()),SEQUENCE(1,COLUMNS(Authors))),
MatchesArray,
IFERROR(INDEX(BookRows,,3) =
INDEX(AuthorRows,,1),ISERROR(MATCH(INDEX(BookRows,,3),Authors[AuthorID],0))),
SORT(FILTER(IF(y <= COLUMNS(Books), INDEX(BookRows, x, y), INDEX(AuthorRows,x, y-COLUMNS(Books))),MatchesArray)))

This final result is identical to what we would expect if a left join had been performed on the two tables in SQL, and note how the left join has not lost the duplicate War and Peace row. If additional rows are added to the underlying tables, we would expect this dynamic array formula to automatically recalculate and resize as required (although some additional checks would need to be added to make this formula resilient to changes in column structure). Furthermore, with some additional tweaking, this formula could be extended to incorporate the other types of join possible in SQL – right outer, full outer, and inner – and this could be an interesting exercise for those looking to explore dynamic array formulas further.

It seems unlikely that Excel is likely to supplant SQL any time soon when it comes to performing joins on tables of data – this somewhat convoluted nested Excel function of 500 odd characters pales in comparison to the sub-100 of native SQL. (Of course, the wordiness of this function could be ameliorated with the use of LAMBDA functions, as Jack covered in an article in April of this year). However, the fact that this operation is possible in Excel, in a single cell, without resorting to VBA automation, external add-ins, or manual data manipulation, feels remarkable – and hopefully even more creative applications of dynamic array formulas lie in the future.

Jacob Warbrick

November 2021