Optimizing summary tables and lookups?

jperon · June 1, 2024, 7:06am

With the new timer feature, I began looking at bottlenecks in quite a big database (with especially a table with 147k records on which I have many summary tables): the biggest bottlenecks come from columns named #summary#…, #lookup#…, #lookup##summary#…. Hence my question: is there a way to optimize those “columns”, please?

In the same vein, I was wondering whether it would be manageable to build a docker image based on pypy or to optimize the core with pyjion.

Many thanks again for all the great work!

jperon · June 3, 2024, 4:55pm

About lookups, I’d especially have some questions:

will using several “hierarchical” criterias accelerate or slowdown lookups? For example, is this better to use MyTable.lookupRecords(category=$category, subcategory=$subcategory) or MyTable.lookupRecords(subcategory=$subcategory)?
if using several independent criterias, does the order matter? For example, let’s say I have a table with 100,000 animals, which can belong to 4 different species, and be of 10 different colors: would Animals.lookupRecords(species="dog",color="black") and Animals.lookupRecords(color="black",species="dog") differ performance-wise?

Thanks in advance for any clue!

jperon · June 4, 2024, 8:29am

A first result I got after trying: just deleting the group column from the summary tables that didn’t use it (because they had only lookups from other summary tables) gave an average 2 seconds gain for each summary table.

dmitry-grist · June 4, 2024, 1:13pm

Let me try to answer a few of your questions.

Firstly, no, there aren’t any straightforward ways to optimize the helper columns. There may well be significant optimizations by rewriting parts of the data engine, but this is the most complicated part of it (specifically, lookups are hard for dependency tracking – knowing what formulas need to be updated when some cell changes).

We had some experience with pypy – an older version of the data engine ran in a pypy-based sandbox (which had a clever way to do sandboxing, but no longer maintained PyPy’s sandboxing features — PyPy documentation). I don’t believe anyone has tried it since we moved to Python3 or gvisor-based sandbox.

There is no difference between T.lookup(A=..., B=...) and T.lookup(B=..., A=...) – these calls translate to identical helper columns.

For other optimization questions, no easy answers. The formula timer is new to us too, so any empirical learnings are welcome

Interesting observation about the group column! Usually summary tables act on the underlying data, and the group column is needed and useful. But when the goal is summary of a summary (for which it would be neat to have better support), then group may not be needed and is wasteful.

jperon · June 4, 2024, 6:25pm

About group suppression, it turned out to be a bad idea: some summary tables didn’t update any more when data changed in the main table. Should it be seen as a bug, or is this the expected behaviour?

dmitry-grist · June 4, 2024, 9:52pm

group columns are expected to be there normally, so it’s not too surprising that something breaks in its absence (I was surprised to find out that it can even be deleted ). But there is nothing obvious for why it would stop updating. In a simple test with a deleted group column, the summary does update (gets new rows, formulas updating). Do you have a simple example that reproduces this issue?

jperon · June 5, 2024, 6:03am

Sure, this occasion taught me that group is really special (it can’t be manually created once deleted). A quick test gave me the same result on a simple case (working summary table without group); I’m going to try and find in which exact conditions this deletion changes their behaviour.

jperon · June 5, 2024, 6:56pm

There may well be significant optimizations by rewriting parts of the data engine, but this is the most complicated part of it (specifically, lookups are hard for dependency tracking – knowing what formulas need to be updated when some cell changes).

I had an idea about that, but before suggesting it on the issue tracker, I’d like to make sure it isn’t unrealistic. The case of the big table (150k records now) I have in mind would illustrate it quite well: each record is a score, with references to the period, trimester, course, evaluation to which it belongs. On that table, I have multiple summary tables, which calculate averages, rankings, for each period, trimester, course and so on.

A thing I know, and that Grist cannot “guess” by itself, is that scores won’t ever change after the end of the trimester. So it would be nice if I had a way to indicate “Keep rows meeting such condition totally frozen, never re-calculate them”, or even “Keep rows whose id is below such value totally frozen”. This would be true even at first loading of the document, or when reinitializing the engine. That way, when adding new scores, the lookups and summaries would get updated, but the engine wouldn’t have to “look at” old rows.

The fact is this performance point is getting more important for us: we now have two years of data in this document, and getting more and more still working… makes me look for solutions to this problem before beginning the third one.

dmitry-grist · June 6, 2024, 4:22am

That’s a very good point. Having a lot of unchanging past data is definitely a common reason to have too much data, and since it’s unchanging, it ought to be possible to optimize.

The details are difficult though. It’s unclear how to optimize. Even if we mark the “old” rows with a special flag (like “archived”), it’s very hard to think about how to organize the code so that it recalculates when necessary and avoids calculations when unnecessary.

The poor-mans approach is to separate old data into its own table. Two things may help:

You can Detach a summary table, and convert formulas to data. Then it’ll stop updating.
You can convert the old data table to an On-Demand Table to avoid loading it into the data engine at all. This is far from a finished feature, but enables keeping and looking at some data that’s invisible to formulas.

I think the main difficulty is to actually allow some calculation from archival data (e.g. when archiving it in the first place, or archiving more data), and to keep it in the same table (e.g. for browsing / filtering / linked views). I haven’t yet seen a proposal for making this efficient and convenient. But I agree that it seems possible and desirable.

jperon · June 6, 2024, 3:32pm

I’m aware of detached summaries and on-demand tables. On-demand won’t fit our needs, as we need to be able to use scores in some formulas.

You evidently have in mind things I don’t know about the data engine, but to sum up what I understand for now, i’d say there are three aspects:

having the data of (fields of) a record available for calculations in other tables (either normal, or summary ones);
having the data of (fields of) a record automatically updated;
for summary tables, having the aggregated data updated.

Making a table on-demand disables both 1 and 2.

Detaching a summary table disables 3.

Converting formulas to data disables 2 for all records of a table, and isn’t possible for summary tables.

What I’d find awesome would be to disable 2 on some records depending on conditions, while keeping both 1 and 3. This would allow for things like “Don’t update periodic averages older than 1 trimester, but do update average of averages on other summary tables”.
It woud be even more awesome if those conditions could be defined in the same vein as access rules, at field or record level.

dmitry-grist · June 7, 2024, 7:20am

Actually, I think what you are suggesting may indeed be possible.

In fact, I tried something that kind of works today. I have columns A, B, and Paused. B has this formula:

if $Paused:
  return PEEK($B)
UPPER($A)

The first two lines is what pauses the calculation. When $Paused is true, it returns the previous value of B, without creating any dependencies. Now, changes to $A no longer change it! But you can make it recalculate by untoggling $Pause.

For proper lookups that are optimized in a similar way, I still expect substantial work would have to happen. But I wonder if this idea can already be put to interesting uses.

jperon · June 7, 2024, 9:48am

It’s certainly something I’ll look at! Till now, PEEK was something I looked at with some kind of respectful fear, but now I think I’ll go and master it!

Rogerio_Penna · June 7, 2024, 12:20pm

isn´t it a problem that all data is always loaded in Grist?

most other systems I use with tons of data, load of period when you enter a table. And then have loading filters, whe you select what data you want to load. The calculations are then performed on that loaded data…

dmitry-grist · June 7, 2024, 1:12pm

@jperon, that is an appropriate attitude to look at PEEK with (respectful fear)!

@Rogerio_Penna , loading all data is indeed a problem when using Grist with tons of data. Grist was created with “spreadsheet-sized” data in mind (and that’s how we recommend using it still, for under 100k rows), and for that, loading all data has worked and is simpler. But we do absolutely want to support larger data, and for that we’ll have to avoid loading all of it (into the data engine, and into the browser), or running all calculations at once. It’s a big project, listed here: Support larger databases · Issue #43 · gristlabs/grist-core · GitHub

jperon · June 7, 2024, 2:08pm

Just the PEEK trick on the summary tables of the biggest one gave me about 15s win on our database.

While brainstorming, I was wondering: I understand that achieving multithreading would be something quite complicated for calculations, because of dependency problems. But wouldn’t aggregations be easier to distribute on several cores? This could be a way to drastically optimize those #summary# and #lookup# bottlenecks.

Thank you @dmitry-grist for all the consideration and time you give to this thread!

LanaWatson · June 10, 2024, 10:08am

Thanks for the insights.It is clear that optimizing these aspects is complex. The PEEK trick is fascinating, and I’ll definitely explore it further. Regarding old data, separating it into a different table might help but isn’t perfect for our needs. Could multi-threading be a potential solution for handling aggregations to reduce bottlenecks?