- maths question – standard deviation
carefull now, i’m a bit thick.
i’m using a machine/software at work to collect data, and analyse it.
there are filters i can apply to reduce the noise/spikes in the data.
(if one data point says 0.05, but the other 500 say 0.01ish, then i don’t care about the 0.05)
the filter has a setting – a standard deviation value.
(default setting = 3)
now, my less-than-confident understand of ‘standard deviation’ is that the value is in the same units as the data (in this case, millimeters).
and standard-deviation is the square-root of the average-squared-deviation.
and it’s given the name ‘sigma’
so, a default setting of 3 will be letting all my data through – right?
and if i want to remove the spikes, i should set the filter to 0.02ish?
But, i’ve got conflicted understanding; the bell-curve distribution diagram that keeps popping up on google says that +/- 1 sigma covers 68% of the data, and 2 will cover 95%.
so, what’s the difference between my standard deviation (sigma) value of 0.01ish and the standard deviation curve sigma?
and what do i set my filter to?
if i get these numbers wrong, we won’t be able to calculate our path through hyper-space, and we’ll overshoot the earth by 12 light years…Posted 6 years agobristolbikerMember
Um, I think you need to consult the manual to find out how the filter is applied – I might be misunderstanding your question, but from your first example….
(if one data point says 0.05, but the other 500 say 0.01ish, then i don’t care about the 0.05)
If you SD is set to 3, you are passing ~99.7% of data, so your one eroneous point of 0.05 in 500 represents 99.8% of ‘good’ data at 0.01 and should be filtered. You’ll have to find out how the averaging is done, whether the data is filtered on absolute data values or propensity of each individual value, or some other mechanism.
I’m sure this helps not one jot, but please stay safe out there in Hyper-space….Posted 6 years agosweaman2Subscriber
So.. I’m not a statistician so I’m hoping someone else will chip in…. but the Sigma assumes your data has a normal distribution so if data is normally distributed then 68% of the data lies within one Standard Deviation of the Mean.
3 times the S.D is I think 99% of the data so should just take out real outliers.
But… your data doesn’t sound like it is normally distributed so you might have a bit of a different problem.
Can I ask is the data spatial? (i.e x,y,value) As to me working with spatial data a filter of 3 would imply filtering on a 9×9 matrix.Posted 6 years agosasMember
The 65%/95%/99.7% figures only apply if your data is normally distributed, i.e. if it follows a bell-shaped curve. You won’t know unless you can actually plot it.
Can’t you do what everyone else does, and just fiddle with the setting until it filters out the unwanted data points?Posted 6 years ago
ta muchly for the replies so far, there’s some good stuff there for me to think about.
and yes, it’s X Y data, at least i think so…
(it’s complicated, and the software is a bit of a black-box)
Can’t you do what everyone else does, and just fiddle with the setting until it filters out the unwanted data points?
i’d love to, but i foresee a time when i get questioned about my method, and saying ‘er, that’s the value that made my graph look nice’ may not be good enough.
it will help if i can at least pretend to understand what i’m doing, and have a stab at explaining the implications.
(as with so much in life, i only need to understand a little more than my inquisitor to look like a genius, get the girl, and save the planet)Posted 6 years agosweaman2Subscriber
If the data is x,y data then you really need to understand the filter more. A filter of 3 on x,y,value data may not actually reject the data but just smooth it out (so changing all values in the local area)
But if it is x,y data than a map might be informative or plot all the numbers to look at the distribution (map data isn’t normally distributed in general)Posted 6 years agoKlunkMember
when i use to do this sort of thing for mocap data, we would use a fft to derive the cutoff frequency for the filter (which is nearly always 6hz for human motion). the fft will decompose your data into components of different frequencies, you can then see where the bulk of the frequencies in the data lie and then set the cutoff frequency appropriately.Posted 6 years agoJacksonPollockMember
Std Dev is the square root of variance.
The filter will take out data that is beyond 3 * Std Dev from the mean of that data sample.Posted 6 years ago
for example: say the calculated std dev = 0.002 and the average (mean) of the 500 data points is 0.017
then 3*Std dev=3*0.002=0.006
0.017 (mean) + 0.006=0.023
0.017 (mean) – 0.006=0.011
So the filter presumably (based upon my example) will ignore data <0.011 and >0.023. These are your ‘tolerance’ limits.polyMember
sweaman2 – I don’t think it is X/Y spatial data and applying a matrix filter to it. The clue is that the software refers to this value as the standard deviation which doesn’t have an obvious relevance; I suspect the OP isn’t understanding what you mean by “X/Y” data – he assumes it is data plotted on an x and y axis.
ahwiles – as others have said to get an accurate understanding of what the software is doing you need to consult the manual or some other source of expertise. It is quite possible that the programmer knows no more than you and is making up the stats as he is going along! However if we assume that he knows what he is doing I have seen similar features before, let me explain what those do and you can try and follow how it might apply to your instrument:
– lets assume you are logging a temperature every second.
– lets assume you want to plot the average temperature every 10 minutes (600 measurements).
– lets assume you want to filter out rogue values that might be caused by electrical or other noise/spikes outside the range that could reasonably be expected.
– lets assume your data is normally distributed (wiki will provide a good explanation but its relatively easy to plot those 600 points on a histogram and see if it looks bell shaped or not).
– lets call the mean (average in “laymans” terms) of your 600 data points: x
– lets call the standard deviation of your 600 data points: s (because sigma is too hard!).
– (as others have said) for normally distributed data ~ 68% of the data points will fall in the range x-s to x+s
– 95% of the data will fall in the range x-(1.96)s to x+(1.96)s [often rounded to +/- 2s for simplicity). So with 600 data points we would expect 30 points to fall outside the range.
– 99.73% of the data points will fall in the range x-3s to x+3s. With 600 data points you would expect one or two values to fall outside this range.
– 99.994% of the data points will fall in the range x-4s to x+4s. With 600 data points it would unusual to find any valid data point in this range.
If that makes sense, then what I believe the value you are setting in the software filter may be doing is calculating the standard deviation of the whole data set (all 600 points in my example) then discarding any data which are over 3s (if it is set to 3). If you have 600 data points that might be slightly dangerous because you would expect to get some valid data in that range. If you only had 60 data points it becomes much less likely that you get data points in that range (and therefore more likely that any point that is a genuine erroneous measurement).
Now if your data is perfectly symetrically distributed and you are only interested in the mean (not the deviation / error / spread of data) then it may be OK to discard some valid data too – but its normally not good to throw away data that is, or may be valid.
Beware if your data can never become less than zero (e.g. optical absorbance measurements, temperature on Kelvin scale,etc) or your measurement system cannot provide negative values, but you are working close to zero then your data may suddenly no longer be “normal distribution”.
So going back to your data then if it has a mean of 0.01 and a standard deviation of 0.005, and a normal distibution and can go negative. You would roughly expect* about 475 of your data points to fall in the range 0–>0.02. If you set the filter at 4s then any value > 0.03 (or < -0.01) would be discarded. Realistically though if it is one value out by that much its not going to cause a massive shift in the data (roughly the one value in 500 being out by that much would shift the mean from 0.01 to 0.0101 – a badly applied filter could cause similar errors). I’ve only seriously used such filters when processing lots of data to screen out hugely erroneous values – e.g. if you got a single result of 0.5 amongst the others; or if you frequently get spikes you are sure are caused by an artefact.
Sorry – that was supposed to be a short answer!
* ie. if you did this experiment lots of times the average no of values in that range would be around 475 – it is quite possible to get weird results just as you could toss a fair coin and get heads ten times in a row.Posted 6 years agomatt_blMember
You don’t really say what you are intending to do with the data afterwards.
All you are describing is a very crude outlier rejection.
You could simply use robust statistics on the data and one result in 500 would make no difference to the results of your analysis, no matter how far from the central value.
MattPosted 6 years agochoronMember
ahwiles: You could just crop the highest and lowest values. It would be better to know a little of what kind of signals (dimensions etc) you are using and what you will do with them though, I may be able to give you some more sophisticated advice.
IanMunro: LPF design is pretty straightforward if youre using something like Matlab. Are you after a pseudo-analogue filter (approximation of Butterworth, Bessel, Chebyshev etc), or an FIR? Both are fairly straight-forward with the appropriate tools.Posted 6 years ago
The topic ‘maths question – standard deviation’ is closed to new replies.