In which Reuben compensates for his terrible programming skills by throwing 400 processors at the problem:

The good news is I got my reverb raytracer working using OpenCL. It’s not a fully fledged application or anything, it’s very much a proof of concept. But I’m getting results that are comparable to those of my Python implementation. And it’s MUCH faster. To compare: 

  • In Python, going from a .obj file to an impulse response takes about 5 minutes. The trace uses 1024 rays, which gives fairly precise results. My previous (sequential) C++ implementation took about 20 or 30 seconds.
  • In the new C++/OpenCL version, the entire program completes in about a second, but most of that is warm-up (the profiler says it spends 56% of the time in the ‘gcl_create_dispatch_queue’ function, which is a pretty fast function anyway). The actual trace and flatten take around 0.1 seconds, and then writing to file takes another fraction of a second. And that’s with 4096 rays!

The bad news is it’s still not fast enough for real-time reverb.

I haven’t looked into really optimising for OpenCL, and I know I’m doing lots of things really badly (non-coalesced reads, diverging paths, putting everything in global memory. I think I read somewhere that you need to do about 200 arithmetic operations per memory fetch to get max flops, and I’m probably nowhere near that. But I don’t know how to profile and check…). Also I have a mysterious ‘printf’ in my kernel that breaks everything if it’s removed (maybe the OpenCL compiler optimises something important away when it’s not there? Who knows).

Might give a rest for the time being, do a bit more reading about optimising for OpenCL, then come back to it in a couple of weeks. Still not exactly sure what I want out of this project. Ideally, I’d like to make it easy for non-games-programmers to write music with dynamic reverbs, but there’s still a lot more to do before I’m anywhere near that point.

Back to University in a week.

This was posted 12 hours ago. It has 10 notes.

Anonymous said: How much followers you have?

reuben-thomas:

Let’s just say I won’t need to worry about integer overflow for quite some time.

That’s signed 8-bit integer overflow btw.

*cries*

This was posted 4 days ago. It has 13 notes.

Anonymous said: How much followers you have?

Let’s just say I won’t need to worry about integer overflow for quite some time.

This was posted 4 days ago. It has 13 notes.

Anonymous said: is there anywhere we can find your source code of processing gifs? Thnx.

Mostly no, but occasionally I supply links at the bottom of individual gifs (like this one) to their source. All the sources that are online can be found on my github gists page.

This was posted 4 days ago. It has 1 note.

Trying to do this properly, so I’m putting my completed assembly exercises (from the book ‘Programming from the Ground Up’) under version control, writing CMake build scripts for them, and hosting them here on the off-chance that anyone is interested in finding out exactly what the creeping advance of clinical insanity novice assembly code looks like.

This was posted 1 week ago. It has 4 notes.
Trying to learn assembly (again).

Trying to learn assembly (again).

This was posted 1 week ago. It has 50 notes.

Went for a cycle for the first time in a gazillion years. Took a wrong turning somewhere and accidentally ended up in a world exactly like the one I’d left, but with twice as much pain.

This was posted 2 weeks ago. It has 13 notes.

Further to my last post: when reading assembler output, always check you’re reading the optimised release output rather than the debug output!

My optimised loop actually looks optimised now YAY

This was posted 3 weeks ago. It has 3 notes.
Well there’s a spelling mistake, and the addsub’d components don’t match the given equation. Haven’t traced to see if the graph is the issue or their choice of ops. Might be worth trying to solve this yourself using this methodology but check it :p

Yeah, you caught the same stuff I did I think.

As far as I can tell, when multiplying (a + bi) (c + di) the result the slide gives actually turns out to be (ac - ad) + (bd + bc)i which is nonsense - the actual result should be (ac - bd) + (ad + bc)i.

At the moment I’m solving this by doing a ‘shufps’ on the result of the lower ‘mulps’ in the diagram, which rearranges the register into the order

| a1i * b1i | a1r * b1i | a2i * b2i | a2r * b2i |

so that the ‘addsubps’ ends up adding and subtracting the right components.

This should mean that I can do two complex multiplies in six instructions (I think), but I’m doing this all with intrinsics rather than raw assembly, and the assembler output still looks quite bloated to my untrained eye. Hopefully I can find some way of boosting it a bit more.

This was posted 3 weeks ago. It has 4 notes.
Pulled this slide from an Intel presentation that I found here. I’m 99% sure this algorithm doesn’t actually work. Tried implementing it without any luck, then went back to check the presentation, at which point I realised that the method looked broken.
Two questions:
Is this actually broken, or am I too tired?
If it is broken, is there a similarly-fast method I can use instead? I’m not remotely familiar with the Intel vector instruction set, so I have no idea whether there are accepted idioms I could use for this (or even where to start).

Pulled this slide from an Intel presentation that I found here. I’m 99% sure this algorithm doesn’t actually work. Tried implementing it without any luck, then went back to check the presentation, at which point I realised that the method looked broken.

Two questions:

  1. Is this actually broken, or am I too tired?
  2. If it is broken, is there a similarly-fast method I can use instead? I’m not remotely familiar with the Intel vector instruction set, so I have no idea whether there are accepted idioms I could use for this (or even where to start).
This was posted 3 weeks ago. It has 5 notes.

I have a non-uniform partitioned convolution working now. Not threaded yet, but I’m not sure whether it even needs to be.

With the project built for release, it can do convolution with a 6 second stereo impulse in real time, using 10% cpu on my macbook.

I compared with Space Designer: running within Logic, the same impulse response used 30% cpu… but that’s with all of Logic running at the same time.

Even so, it looks like my naive non-optimized implementation is performing OK, and if I thread it and replace the tight loops with SIMD instructions it’ll fly!

Feels nice to build something that’s actually fast for a change :D

This was posted 4 weeks ago. It has 12 notes.

In the end, I found one paper, but it’s fairly disappointing. It boils down to convolving the input signal twice, with the ‘previous’ and ‘current’ impulse responses, and then cross-fading in the time domain between the two outputs.

This was my ‘last resort’ idea all along, and it’s quite reassuring to know that there’s a paper I can cite if I do decide to go this route, but it doesn’t seem like the best solution somehow. The paper even says that ‘the output of the linear time-varying (LTV) system is a mixture of two linear time-invariant (LTI) filter signals which may not correspond to a true intermediate out signal.’

Well, whatever. I can give it a go.

This was posted 1 month ago. It has 5 notes.

I’m thinking about raytracing audio in realtime, having seen (heard?) some demos of the technique. I found a talk about a Quake 3 mod which uses raytracing for audio, so I know it’s possible, but I’d like to make a tool that’s musician-friendly for doing this kind of stuff. Also offline time-invariant impulse responses just aren’t very exciting at this point.

I’ve built a simple partitioned fft convolution (by which I mean a tiny class wrapper around some fftw instances), which in theory should allow me to build a program that does real-time convolution of pre-rendered impulse responses (like space designer or altiverb). It’s not perfect, and it’s not very performant, but it’s good enough for now. I’m fairly sure I can get it down to interactive latency levels if/when I need to.

However, what I really need is an algorithm for doing fast linear time-varying convolution. I don’t know how to do that, and I’m having a hard time researching it.

The key here is ‘fast’ rather than ‘linear time varying convolution’. I know I could do discrete convolution with a different impulse response per input sample, but that would scale horrifically, and I’m sure there’s some kind of fourier trick I can use instead… but I don’t know what that trick would be.

Question time: Does anyone know about programming techniques for fast time-varying filters, or papers I should look up?

This was posted 1 month ago. It has 8 notes.
This was posted 1 month ago. It has 348 notes.
This was posted 1 month ago. It has 272 notes.