A ten-minute voice memo is usually seven minutes of talking. This finds the dead air — thinking pauses, shuffling, the walk to the door — and cuts it, leaving natural breaths alone.
Energy per 50 ms block, against your threshold. Blocks quieter than the threshold for longer than your minimum get collapsed to a short beat (a fifth of a second) so speech doesn't slam together unnaturally. Breaths and word-gaps are shorter than any minimum here — they survive, which is why the result sounds edited rather than robotic. If your recording is noisy enough that "quiet" isn't quiet, run it through the noise remover first; the cutter sees silence much more clearly afterwards.