Getting from long data to summary tables

Jordan_Stoeger · March 2, 2022, 7:12pm

Hello! I am working on data related to people who have outcomes for multiple events. Bear with me, this is a complicated data model and I am also new to relational databases (though far from new to large unruly data sets).

Background

There are multiple types of events. Within events there are various outcomes that map back to a binary variable (success/not success).

The core data is in a table with a line for each person-outcome.
I have another table that links outcomes to a type code.
And for each outcome type, I have a column that links the outcome to a numeric variable.

For example:

Task Type 1 can have a success or multiple codes for failure. So for the Type 1 Binary Codes column, I have

outcome 1-DQ is associated with Task Type 1 and has a numerical code of 0
outcome 1-LOSE is associated with Task Type 1 and also has a numerical code of 0
outcome 1-WIN is associated with Task Type 1 and has a numerical code of 1
outcome 2-WIN is not associated with Task Type 1, so has a blank value

Current Status/Problem

I have a table that sums the number of binary outcomes of a given type for each person (filtering to remove any blank values). BUT I have up to two rows for each person. For Example, the table for Task Type 1

person1 | binary 0 | count
person1 | binary 1| count

Desired Solution

What I want is a table with 1 row for each person and a column for each binary outcome type, with the count as the value:

person 1| binary 0 count | binary 1 count
person 2| binary 0 count | binary 1 count

I know I could do this in R with a quick pivot_wider from the tidyr package, but I’d prefer, if possible, to keep it all within Grist so that my values update when I add/change data.

Thanks

Thanks for reading. I’m happy to answer any questions to help you gurus guide me to a solution.

Jordan_Stoeger · March 3, 2022, 4:27am

Follow up question… Mostly unrelated What is gristHelper_Display? Sometimes when I am making references I get this in my option list (and sometimes it’s what I expect when I use $existing_column.reference_column where existing is the name of raw data and reference is a reference I want to tap from some other table.

natalie-grist · March 3, 2022, 2:32pm

Hi Jordan!

I wrote up an example using lookupRecords formulas. Check it out at the link below:

https://public.getgrist.com/qZzjwpeVKmtn/Community-708/m/fork

The first table is Task Type 1 that lists Names and outcomes (choice of either “binary 0” or “binary 1”).

The second table lists Names and then the counts for each Outcome type (“binary 0” or binary 1").

The formula used is:

len(Task_Type_1.lookupRecords(Name=$Name, Outcome="binary 0"))

The lookupRecords formula finds a list of records that match the arguments we give it. It follows the format of [Table_Name].lookupRecords([A]=$[B]) where [Table_Name] is the name of the table you want to lookup data in. [A] is the column in the table being looked up (named at the beginning of the formula) and [B] is the column in the current table / the table you are entering the formula in.

Here, we gave it two arguments; Name=$Name, Outcome="binary 0". So, the value in the Name column of the Task Type 1 table needs to match the value in the Name column of this table (Task Type 1 Totals). Then, the value in the Outcome column of the Task Type 1 table needs to be “binary 0”. It finds a list of records where both arguments are true. For the Binary 1 count column, the only thing that changes in the formula is Outcome = "binary 1".

len() counts the number of items in the list found by the lookupRecords formula.

What is gristHelper_Display ?

When you are creating that reference, you’re specifying which column to show instead of the default RowID. Behind the scenes, Grist creates a hidden formula column called gristHelper_Display with a formula like $reference_column.<name of chosen column> which is used to display the value from the column you have selected.

In the screenshot below, column A is a reference column to the Name column in the Task Type 1 Totals table. In column B, I added the $gristHelper_Display. You see that they are the same. The formula behind $gristHelper_Display is equivalent to $A.Name

Let me know if you have any other questions

Jordan_Stoeger · March 3, 2022, 3:44pm

Wow! Thank you so much for taking the time to put this together. I do have more questions.

The first thing I noted when I went into your example is that the “binary 1” and “binary 0” values in Table 1 are in little grey boxes. I’ve accidentally clicked Reference List as a data type, so I’ve seen this but never used it. Can you explain why you did it that way?

Is the code string you’ve shared python? (I’m assuming it is) – I’ve got a very small amount of python knowledge, but it looks very familiar to R, which I am quite comfortable in. This looks like a next-level VLOOKUP in excel, which is an old friend I’ve been trying to ditch for years (we broke up because I love left_join so so much).

I just want to reiterate back to you what I think the code is doing to see if I’m on the right track. len is looking for the length of what’s returned. lookupRecords requires a column in the new table where this code will go called $Name and the code string looks for records in the table Task_Type_1 where Name equal to the value in the particular row of the new table and counts the values equal to the text string "binary 1" or "binary 0" and reports out the length (len) of the records that match the query of the particular name and the particular outcome.

This is making sense – I hope I’m on the right track. I got it to calculate within my own workspace, so I feel like I might be onto something! (Note: I did not have to turn my Outcomes into a Reference List type, so I am even more curious why your little grey boxes are present).

Having successfully cleared this hurdle (thanks in large part to you!), the next thing I’d like to do is get a sum of the total number of events each person participated in. Even after I change the count columns in Table 2 to data type numeric, I can’t use sum($Binary_0_Count, $Binary_1_Count) in a new column to compute that (I get #TypeError, ‘float’ object is not iterable). (I also tried =$X+$Y but that had the same issue). Further, I’d like to know their “success record” so in addition to seeing the total number of events, I’d like to take their successes or $Binary_1_Count and divide it by the total number of events.

Now, going back to gristHelper_Display sometimes if I don’t use the gristHelper I get a data error, and sometimes I expect to be able to reference something but it doesn’t appear to be there. I think this is because my references are sometimes too far detached from the data I’m pointing to.

For example, within the table Task_Type_1 there’s a column that links the name and outcome to a specific event (with an event ID) and then there’s another table (Events) linking the event ID to a year.

I can use =$event_ID.Year to get the year. Yet another table (Years) that gives metadata about specific years. One piece of info is a Boolean/Toggle variable. Let’s say I want to pull in that Boolean variable. Since Year is already the type of data that has an = sign at the beginning (what I think of as a reference, but I’m not sure that’s the right terminology) , I can’t query it for the Boolean variable. That is when I type =$Year. there are no options listed (even though I know there are many fields connected to the Year code via the Year table).

I’d like to be able to say =$Year.Boolean and then (I think) I would be able to add an argument to the len column that says Boolean = True or something to that effect… but since the two variables, Boolean and event, have an intermediary piece of data connecting them (Year), that doesn’t seem possible (though after your amazing answer here, I’m almost certain it is!)

Alternatively, I could manually enter a list of years (ones where my Boolean variable is true) and that would also work, but I’m not sure how I might do that (in R, it would be using %in% with c(X, Y, Z) within the argument).

Thank you, again, for your help! I’m having wayyy too much fun learning here. Grist as a connection between SQL/relational databases and spreadsheets has unlocked something remarkable in my brain. Grist is so, so powerful and it’s been such a pleasure to fumble around in it.

Jordan_Stoeger · March 3, 2022, 3:47pm

reading back through this, I’m thinking I can probably use this lookupRecords (or something similar) to connect my =Year column to the Boolean variable, but I have to get back to work, so I will excitedly await your assistance.

natalie-grist · March 3, 2022, 4:59pm

I made the Outcome column a Choice column. Since there is a specific set of values you can choose from, this is a good way to go since it eliminates any spelling mistakes leading to multiple versions of the same value. This is also one of the many reasons references are so great. Using Names as an example, it’s very easy to misspell someone’s name. If you have a list of Names then reference that list, you eliminate spelling errors (and also can reference other information from that record).

Yes, it is Python! The entirety of Python’s standard library is available to you for use in formulas. lookupRecords and lookupOne (only looks up a single record) are Grist’s version of VLOOKUP.

EXACTLY!

Again, using Choice type column here. Choice Lists also exist which allow you to select multiple options from a list.

The Color column of the Color table of the Inventory Manager template is a Choice column. We can only select one choice from the list. This column is seen in multiple tables so we actually reference the original Color column of the Color table like you see in the Color column of the Product Variant table. Just like with any other reference column, you see all the options available. If you added another color to the choice column, it would appear here. So, if you needed to use the same ‘binary 1’ and ‘binary 0’ options in other tables, you could create a table to store the choices then reference that column! This saves you the trouble of having to add a new choice in multiple places.

Check the column type. If you change the column type to Numerical, you can add $Binary_0_Count+$Binary_1_Count without error. It’s possible the column type is set to Text - or if it’s Any, it might not recognize that it’s a numerical value and be looking at it as text instead. On the example document, I added two columns for Total Events. The first uses this formula:

len(Task_Type_1.lookupRecords(Name=$Name, Outcome="binary 1")) + len(Task_Type_1.lookupRecords(Name=$Name, Outcome="binary 0"))

This just combines the formulas used in the Binary 0 Count and Binary 1 Count columns. The second Total Events column uses this formula: $Binary_0_Count+$Binary_1_Count. Right now, it gives a #TypeError. Change the Binary 1 count column from text to Any or Numeric and it will get rid of the error

An equal sign shows that the column is a formula column . There’s a link icon that shows it is a reference column. My Year column in the Task Type 1 table is both a formula column and a reference column so it shows both icons.

In our example, I created two new tables, Years and Events. In the Task Type 1 table, I have a reference column for Event ID that references the Event ID column of the Events table. The Year column is a formula column that pulls the Year for the record referenced via Event ID. This is also a reference column to the Year column of the Years table. So, it’s pulling the year associated with the event but also points to that Year’s record in the Years table. Next, we have the Year Boolean column. Another formula column that pulls the value from the Boolean column of the Years table for the record selected in the Years column.

The Event ID column of the Task Type 1 table points to the entire record for the ID shown. The ID is just the specific value we chose to see. Same with the Year column. It points to the entire record for the year shown. We find the year via formula but we pull the record via reference.

I added two columns to the Task Type 1 Totals table specifically for this. The formula for Binary 0 Bool = True is:

len(Task_Type_1.lookupRecords(Name=$Name, Outcome="binary 0", Year_Boolean="True"))

Then, for Binary 1 Bool = True, just update the Outcome to “binary 1”.

This is what we love to hear We are so happy that you are enjoying Grist! It really is a lot of fun.

Jordan_Stoeger · March 3, 2022, 9:16pm

I made the Outcome column a Choice column. Since there is a specific set of values you can choose from, this is a good way to go since it eliminates any spelling mistakes leading to multiple versions of the same value.

Ok. The way I have my tables configured is that any time I reference a person, I actually reference a Person_ID and then use the Reference column type and then pull the Name using $Person_ID.Name.

Am I correct in thinking that this is a lot like a choice list, except that instead of setting up a choice list, the choice list is dictated by my Person_ID column and the value can only be one that already exists in the original list of IDs. Perhaps this is what you’re referring to just before the emoji?

I’ve got a success rate for people by both task types. WOO! I do have a few rows where there is a divide by 0 error (the total number of events of that type the person participated in is 0). Is there a way to get Grist to accept that as a 0 value, despite it not being quite correct mathematically? I’d prefer that over having to remove their record from the table just in the off chance that they participate in an event at a later date.

An equal sign shows that the column is a formula column . There’s a link icon that shows it is a reference column. My Year column in the Task Type 1 table is both a formula column and a reference column so it shows both icons.

Ok, so… Even if the value that is being referenced (the original location of the record that has a = sign in the new location) is a reference, I still need to tell Grist I want this version/location of the record to also be a reference.

When I try to do it to a column that already has =, It’s not playing nice, but if I make it a reference first and then implement the equation, it’s working!

Jordan_Stoeger · March 3, 2022, 9:28pm

Hello again!

If I wanted to exclude something from my lookup, I expected Column!=“Value” to work, but I’m getting a Syntax Error.

natalie-grist · March 3, 2022, 10:26pm

Correct - when you create a reference column, your only selection options are what exists in the column that is referenced. For instance, Person_ID only contains the IDs that exist in the reference table. Add an ID to the original table and it will appear as an option in the reference column.

Absolutely. We can add to our formula so it returns nothing when there are no events for a person. My example column is Binary 1 / Total Events. The formula used here is:

if $Total_Events == 0:
  return None
$Binary_1_Count/$Total_Events

The first two lines get rid of the error. The error still technically occurs but instead of it returning the error, it just returns a blank value. When setting up formulas, always add this part after confirming your formula works. Since it hides errors, there could be an issue that you wouldn’t see.

Hm, I’m not sure what the issue could be here. I’m glad it’s working where you create the reference first then the formula though. In the example doc, I created the formula first for the Year column of the Task Type 1 table then changed the column type to reference and it worked fine.

This was a new one for me, had to get help from our pros I created a column called Binary 1 Filtered in the Task Type 1 Totals table. The formula used here is:

records = Task_Type_1.lookupRecords(Name=$Name, Outcome="binary 1")
year = Years.lookupOne(Year=2020)
filtered = [r for r in records if r.Year != year]
return len(filtered)

First, we get our list of records with all of the conditions we do want - just as we did before.
Next, just look up the record for 2020 in the Years table.
Then, we filter out all results in our initial lookup that contain the 2020 record.
Finally, we count the number of records.

The formula for our filtered variable is a collection-based “for” loop. We have to iterate through our list of records and check the Year of each record. If the Year is not equal to our year variable (2020), then we keep it in the list.

You can read about Python “for” loops here: Python "for" Loops (Definite Iteration) – Real Python

Jordan_Stoeger · March 3, 2022, 10:37pm

Wow, thank you! I’m glad to hear some of what I’m trying to accomplish challenged you, too

My main take away is that I need to learn more Python. I’m quite comfortable with for loops in R, so I’m sure there’s a lot of transfer. R is the only programming language I have and I’m finding learning the second one is actually more difficult, since before I was a blank slate. Now I have all kinds of ideas that are (almost inevitably) wrong!

I’d be happy to mark this particular thread as closed/with solution – you’ve gone above and beyond in helping me sort through all of this! In the future, would it be OK for me to respond here again should I have other questions? I’d hate to clog up the Help forum with super-specific questions.

Thank you, again, for all your help.

natalie-grist · March 3, 2022, 11:24pm

If it’s specific to this thread, feel free to post here! Then I can pull up my example doc and keep building on it If it’s a question still specific to your data but not specific to this thread, start a new post. Also - if it’s specific to data that you don’t want to post in the Community, feel free to email us at support@getgrist.com and we can help you there as well.

And for learning Python, check out https://futurecoder.io/ . One of our developers created it!