This technique that works well for my case:
There are always foreground elements in the scene, like pedestrians, birds, or cars. This rules out the approach of choosing a single reference frame and computing foreground elements relative to an empty street. So, we’ll need a more flexible definition of foreground and background:
Start with three frames. There are no ‘clean’ frames in this sample - each one contains cars, pedestrians, bicyclists, and so on.
To isolate the background, we’re going to run an operation over every pixel in each of these images. For instance, for the very top left corner, if that pixel is black for 2 frames and white for 1, running an average over the frames will make it a 66% gray.
Here’s what a mean looks like:
Ghostly cars are not ideal. The same ghosts haunt min & max:
So, statistical minds have already figured it out: foreground objects are outliers, so we’ll need a more robust statistic: the median.
The result isn’t perfect - some of the areas where bicyclist and motorist overlapped are still noticeable in the green protected area of the intersection.
But it’s quite usable: from here, all we need to do is subtract the median from each frame, and we’re starting to see isolated motion.
Now, instead of simply subtracting each pixel from the other, determine whether the difference between the two is beyond a certain threshold and just return one of them if true. This way colors are true to what’s visible in the image, rather than flipped and skewed by the difference between the median and frame.
The frames we picked for this example are close together in time, so they share similar daylight. A larger sample includes much more diversity in lighting:
The solution to this issue is to use windows: instead of finding the median of all frames in the dataset, run medians over local samples. So, a frame of video at 2pm would be compared against a median of all frames between 2:00 and 2:15pm.