{hhr, 200203}
After lot's of debugging and tuning, I'm finally accepting that the sorting (by spectral centroid) works correctly; however perceptually it's a too weak key, so while you can perfectly sort a set of pitches, with complex sounds it is all but apparent that the "brightness" is indeed sorted.
{author: hhr, date: 200131, kind: note, keywords: [sorting, spectrum]}
I worked on a the idea of one-dimensional sort, using spectral centroid as key, with the input being the signal preprocessed through the spectral flatness / contrast selection described previously.
{hhr, 200201}
Still struggling with the multi-pass segmentation I need for this. Trying to split the process into separate stages now, each which can do its own segmentation.
{hhr, 200211}
I have now started another approach for 1-dimensional clustering, hopefully leading to clearer orderings. This is still in progress, though.
{hhr, 200212}
Was looking at implementing the Christofides algorithm, based on a complete graph of pairwise Pearson cross-correlations (of the frequency domain magnitudes); but unfortunately noticed that these correlations do not obey the triangle inequality required for this type of TSP algorithm, so have to find another one.
Actually, with the right mapping, the edge weights do obey the triangle inequality in something like 98% of the cases, so probably that algorithm would work. But in the meantime, I had looked at the Lin-Kernighan algorithm, and found an implementation that I translated to Scala. I still have to integrate that with FScape, but here is a first rendering, using a traveling-salesman-solution to sorting the segments. (I used minimum-phase transform to sharpen the temporal envelopes.) What I like, is that of course you do not get an "overall" ascent or descent, but you get groupings of similar resonances, as if walking up and down an endless staircase.
{hhr, 200221}
FScape integration is there now, and I made a workspace for Mellite that renders the above sound file.
Corresponding Mellite workspace
note: needs Mellite 2.43.1 or higher due to bug fixes
Corresponding Mellite workspace
note: needs Mellite 2.43.2 or higher
{hhr, 200212}
It's the segmentation of the microphone input obtained by thresholding against unusual spectral contents, as stated in the 'foreground/background' comment here. I add an additional threshold for spectral flatness if I remember correctly.
val _sel = (flatness < threshFlatBelow ) &
(arithMean < threshArithBelow ) &
(arithMean > threshArithAbove )
From the selected bits, I calculate filters that emphasise the resonant properties.
val fftFilt = ResizeWindow(inMag, 1, stop = +1) // real-to-complex
val filt0Raw = Real1IFFT(fftFilt, fftSize, mode = 1)
val filt0RawOff = filt0Raw
val fftSizeOut = fftSize
val filt00: GE = filt0RawOff
val filt0 = filt00 * ((LFSaw(fftSize.reciprocal) * -0.5) + 0.5)
val filt0A = BufferMemory(filt0, fftSizeOut)
val filt0B = filt0 .drop(fftSizeOut)
val step2 = stepSize * 2
val filtA = ResizeWindow(filt0A, fftSizeOut * 2, stop = -fftSizeOut) // even
val filtB = ResizeWindow(filt0B, fftSizeOut * 2, stop = -fftSizeOut) // odd
val inCv0 = mkIn()
val segmSpans = segmSpans0 * stepSize
val inCv = Slices(inCv0, segmSpans)
val updA = Metro(step2, stepSize + 1)
val updB = Metro(step2, 1)
val convA = Convolution(inCv, filtA, kernelLen = fftSizeOut, kernelUpdate = updA)
...
{hhr, 200309}
Towards a Rendering Loop
I want to give you an update on what I have been working on recently, hopefully arriving soon at the stage that we discussed in our last phone meeting—being able to render half an hour of "example material".
As the logic for starting and coordinating the processes is simple to explain but less simple to implement correctly, and I am still pursuing the idea that this implementation is done fully within a Mellite workspace (instead of a separate standalone project), the process is slower than I had hoped. But I'm getting there. A lot of the prototyping I have done for the new version of another piece, Writing (simultan), actually applies here, so there is a "pattern" of how these processes work.
Logically, I am dividing the processes into three layers; one is real-time input gathering, one is rendering the various transformations and analyses, the third is real-time sound production. The idea is that they can run each on their own, just exchanging data. The input gathering thus recordings into sound files which are "filled up" as new material is needed (and available, given the planned "common listening" mode!). The rendering layer then takes the current "database" (input sound gathering) if it has sufficient length, and starts non-realtime rendering processes, the output of which form another pool of sounds. This pool is then used for the third layer.
The first rendering process I am implementing is the one that is described on the left here: Applying energy / spectral flatness thresholds, transforming the thus filtered material into pure resonances (minimum phase spikes), and then ordering them using the complete graph of timbral similarities of the segments along with the Lin-Kernighan algorithm. Most of the work I am doing now, is seeing that this runs by itself, drawing input sounds from the microphone, doing the rendering into a pool, playing from the pool. I should be finished with this in the next few days.
TSP ordering, using the original segment spacing (not reordered), "drawn towards" the analysis grid by using a geometrical mean betwen original spacing and analysis grid.
{hhr, 200312}
For the first process, the percussive resonant steps, I am now trying to see what happens if four channels are combined and "synchronised" in pairwise succession either at the beginning or at the end of the phrases.
{hhr, 200309}
Note to self: what happens if we cut in the onset of a segment and use a static (rhythmic) repetition?
{hhr, 200315}
I spent some time debugging a "full" loop for capturing, rendering, playing, so that it works without problems, also on the Pi. The sound is thus still very simple, as I have only implemented on process, and also there is only few parameters which are modulated, so the gestures are quite similar. Still, I think it's a good beginning. (I still have to fix a bug where some analyses produce zero segments which locks up the renderer)
Corresponding Mellite workspace
note: needs Mellite 2.44.0 or higher. See README in the workspace for required sound files from the IEM cloud.
{hhr, 200319}
Perhaps there is not one database (from the microphone), but several, as intermediate processing results feed into different rendering processes. (For example the thresholding goes into the minimum-phase and lin-kernighan process, but also into the Paul stretch process).
{hhr, 200317}
I am adding the second layer now, based on the time stretch algorithm. Since the files in the output pool are getting quite long and contain long pauses (as I want to preserve the original sequence of the thresholding), I want to add segmentation data—obtained from loudness contour—so that during playback the process can interrupt after each segment.
{hhr, 200329}
Integrated the second layer (time stretch resonances) now. Currently not paused during recording. Dynamics is probably still too high; since I measure loudness, I could use that for a slight compression. Binaural image is slightly left-biased for some reason. The Pi is still happy, running the workspace for several hours without errors.
Corresponding Mellite workspace
note: needs Mellite 2.45.0 or higher. See README in the workspace for required sound files from the IEM cloud.
{hhr, 200403}
In order to obtain four pools of similar lengths snippets, we would establish minimum and maximum durations, and then the cascaded process would conditionally write the lower frequency levels, and would conditionally prepend "incomplete" snippets with short cross-fades. I have to test if this works with FScape's If-Then construction, which would simplify running the process.
{hhr, 200403}
Cascaded resampling (16, 32, 64, 128). I wonder whether I should use a whitening filter or not? Somehow I like the idea of attenuating the resonances, as it functions as a counter point to the other process. On the other hand, moving the original spatial resonance up is also interesting. Perhaps both could act as variants?
(factor 16 ensures that while speech is still audible as "speeding tape" kind of sound, the speech content is basically gone, even if you'd hypothetically record the space and transpose it down, only the articulation of the speech is left - with 0.9 roll-off and 44.1 kHz sampling rate, the new upper frequency is 1.2 kHz, only the first formant will be preserved)
Plain recording, down-sampled by factor 16 (2 zero crossings, 87% roll, kaiser 4) on the left, and factor 32 on the right channel.
{hhr, 200412}
There are interesting remarks on transparency (and the paradoxy of the impossibility of transparent white) in L. Wittgenstein (1950), Bemerkungen über die Farben (Remarks on Colour).