Infrastructure monitoring with grist

I had a strange idea, why not use grist as an infrastructure monitoring tool?

For an old (inhouse, defunct) monitoring system, i had written some binaries that could do some specific jobs, like query a webpage, ping a system, connect to a tcp port, etc.

All those binaries had in common, that on sucess, they return 0, on error they return 2 and on warning they return 1. Also they would write error messages to the stdout.

For example, to test an tcp port:

./testtcp  --url ''  --port '3389'

Or test if a specific string is on a webpage:

./web  --containsStr 'Characterization of Selective Autophagy'  --url ''  --timeout '10'  --code '200'

Now, i’ve written a short “daemon” that connects to grist via its api, get all specific checks + their settings,
call the binaries, send the results back to grist, and voila an actually nice system:

I don’t know yet if i want to explore this idea further but so far it actually works quite good.

Edit: The good thing about those small binaries is, you could easily write them, or wrap an existing binary in a shell script, to return those error codes and just call those via the “daemon”, so its easily extendable.

What do you think?


Also the daemon is only 40 lines of Nim code.

import gristapi, json, os, strutils, tables, times
import strformat, osproc, threadpool

var grist = newGristApi(
docId = "<myDocId>",
apiKey = "<myApiKey>",
server = "<myGristServer>"

proc doCheck(name, binary, params: string) {.thread.} =
    let cmd = fmt"./{binary} {params}"
    let (output, exitCode) = execCmdEx(command = cmd)
    case exitCode
    of 0:
      echo fmt"[GOOD] {name}"
      echo fmt"[BAD!] {name}"
    discard grist.addRecords("Checkresults", @[%* {
        "check": name,
        "exitCode": exitCode,
        "stdout": output,
        "date": $now(),
        "cmd": cmd

let checks = grist.fetchTableAsTable("Checks", filter = %* {"enabled": [true]})
for id, check in checks.pairs:
  let name = check["slug"].getStr()
  let checkparams = grist.fetchTableAsTable("Checkparams", filter = %* {"check": [id]})
  let binary =  check["binary"].getStr()
  var params = ""
  for checkparam in checkparams.values:
    let key = checkparam["key"].getStr()
    let value = checkparam["value"].getStr()
    params &= fmt" --{key} '{value}' "
  spawn doCheck(name, binary, params)

This is very cool! I like this approach.

I might suggest a couple tweaks if you take it further:

  1. Automatically clean up results older than a certain threshhold (Grist is limited how much data it handles well, so indefinite growth will lead to slowness and eventual trouble if left unchecked).

  2. You could construct the command line in Grist from the params, then you wouldn’t need an extra fetch for each check. You can also collect results for a single bulk update. Then the entire cycle, as far as interactions with Grist, could be a single fetch of the enabled checks, and a single post of all the results of these checks.

  1. DONE, its WAY faster now. But since its threaded code it seems its now too fast for grist:
Error: unhandled exception: {"error":"Too many backlogged requests for document <myDocId> - try again later?"} [ValueError]

Do you know of a knob i can push to allow more backlogged requests?
Edit: For now i catch the error, sleep a little, and try again :slight_smile:

for 1.
i could do the the cleanup in the daemon, but do you can think of a way to do this inside grist?

Edit2: I’ve now done it in the daemon.
Unfortunately, i had to fetch all the “Checkresults” and then delete them in bulk.
It would be nice if there would be an delete api that accepts a filter.
But i think the current filter implementation also does not support “>” “<” etc right?

I found this thread after searching for “Too many backlogged requests for document” which I am encountering. I did some benchmarks and digging to troubleshoot this issue and this is what I found:

  • Maximum of 10 backlogged requests is hardcoded in the Grist’s source code, and also documented, so to increase it we need to patch the code.
  • When self-hosting, there is no daily rate limit or per-second rate limit.
  • GET requests: With concurrency of 10, I am able to make 70 GET requests per second and 40 POST requests per second.
  • To properly rate-limit, we use p-queue npm library to put requests in a queue that allows max 10 concurrent processing.

Posting it here in case it’s useful to others facing the same error.

