# Does this make sense? A binomial distribution with a twist.

I'm working on a project studying sea ice in the Arctic ocean. A brief overview of the essentials: The ice pack over the Arctic begins shrinking every summer beginning around June 1st, and begins to recover around Sep 15th. I'm interested in the movement of the ice edge as the pack shrinks.

Here are the pieces:
1. Ice "RETREAT" means the ice edge is shrinking.
2. Ice "ADVANCE" means the ice edge is expanding.
3. Wind direction and magnitude can affect whether the ice RETREATS or ADVANCES
4. In the summer, there is a thermodynamic component driving overall RETREAT

Here's what I'm trying to do:
I want to determine (statistically) the wind directions and magnitudes for which wind does affect the movement of the ice edge. For example, if the wind is blowing exactly normal to the ice edge (towards the ice), the ice is pushed back and we get retreat. If it's blowing the other direction, we get advance. Physics comes into play in other directions leading to somewhat unintuitive results.

Here are the basics of my strategy:
I've defined the ice edge as a set of x, y coordinates. For each x,y point, I have the distance that it moved since the previous day. For this analysis, I classify this distance as either a RETREAT, or an ADVANCE. There is one ice edge per day over the summer, and of course multiple points for each day. (the correlation between these points is part of "twist"). But ignoring that for now, here's what I did:

Find the fraction of RETREATS across all points (I've unbiased the data so that all wind directions are equally represented) The answer incidentally comes out to be about 0.6. Call this p.

Treating this like a binomial distribution with a "success" being a retreat, and a "failure" being an advance,
if I draw n samples randomly from this data set, the expected number of retreats should be:
E(R) = np

And the Standard Deviation is given by:
std = sqrt(np*(1-p))

If AR = Actual number of retreats that I did get in the sample, then I can compare to E(R):

# of stds away from E(R) = (AR - E(R)) / std

So, if my "random Sample" is not actually random, but instead selected based on wind direction and magnitude, and I find that "# of stds away from E(R)" is greater than 2, I can be pretty sure that wind is significantly affecting the movement of the ice edge.

Complications:
So, some other complications are:
1. the ice edge doesn't retreat at the same rate over the whole summer. I broke it down into months, and it varies from like p=.58 to p=.62

2. I can't treat each point in the ice edge as a separate observation. It makes more sense to think of the retreat as a percentage of the total ice edge.

I've taken these things into consideration, but I'm not sure if what I did is correct. But before I get into that, are there any other general problems with this technique? Am I so far off in left field that it's not worth considering the rest? If so, do you have any suggestions, by chance?

I hope I've made the problem clear enough. Let me know if more detail is required!!

Wendy

Stephen Tashi
If you are try to do a hypothesis test, try to state the "null hypothesis". Say what the "population" is and what the parameter of interest is.

If you can get enough data, it would strike me as more interesting to develop a model for ice front movement rather than to do a yes-or-no hypothesis test of some sort.

I've defined the ice edge as a set of x, y coordinates. For each x,y point, I have the distance that it moved since the previous day.

If the ice edge grows or shrinks by new ice forming or melting, how to you relate a point (x,y) on one day to a new position (x,y) on the next day? I can see that if you had a marker on the edge of the ice and the ice edge grew by ice forming "inland" and pushing the edge out further that you could define (x,y) from day to day unambiguously by the position of the marker. But if the ice on the edge melted or more ice formed beyond the current edge, how do you relate the (x,y) data on one to data for the "same" place on the next day?

(the correlation between these points is part of "twist").

Do you mean correlation between the data for the "same" place between various days? Is data for the same day correlated for places that are adjacent to each other?

(I've unbiased the data so that all wind directions are equally represented)
How did you unbias the data?

Do you have daily data for air and sea temperatures?

Hi Stephen,

Here's some more details about how I assembled the data I'm analyzing:
1. The daily ice field data (acquired via satellite) is ice concentration data. Values range from 0 (no ice) to 1 (100% ice). It's on a grid, so the concentrations describe the conditions in each bin (25km2).

2. The x,y coordinates are the .15 ice concentration contour of the field. This is standard.

3. To find how far the ice edge travelled from one day to the next, I first found the direction of the maximum gradient of the ice concentration at each x,y ice edge point. This gives me the normal direction to the ice edge. I then follow this perpendicular line to find where it crosses the ice edge from the previous day. I find the distance between these two points. Point one is from today, point two from yesterday.

About unbiasing the data so that all wind directions are equally represented:
For each x,y ice edge point, I have a wind direction (0 to 360 degrees, continuous). My analysis uses these winds binned into 10 degree increments. So I have 36 wind directions spanning the 0 to 360. I'm using 7 years worth of data, so I have quite a lot. To unbias the data, I first counted all the data I had in each wind bin, and made note of the lowest count. Then I randomly subsampled the data in each wind bin so that each bin contained just the minimum count of data. So now all wind bins are equally represented.

In actuality, I did this for each month separately. So all June data has the same amount of data for each wind direction, and the same for Jul, Aug, and Sep.

A null-hypothesis:
The null-hypothesis is that all subsamples drawn from the Big Overall Data Set will have (approximately) the same distribution of Retreats and Advances as the Big Data Set, and that since I'm drawing samples based on wind direction and magnitude from the Big Data (which is unbiased to wind direction), this implies that wind direction has no effect on ice edge movement.

My hypothesis is that wind direction does impact movement of the ice edge. If the distribution of Retreats in the subsample (chosen based on wind direction and mag) differs from the distribution of Retreats in the Big Data Set, then wind direction and magnitude do affect the movement of the ice edge.

Why not just model the process rather than analyze the data statistically?
I have lots of data. Also, there are some complicated processes that affect this problem, including things like how thick the ice is, and what the quality of the overall ice pack is. In August, for example, the ice is much more broken up and moves more easily. Note also that "Ice Retreat" does not necessarily mean that any ice is lost. The movement of an "island" of ice due to winds is actually a pretty well investigated problem, incidentally.

Auto-correlation in the data:
I'm talking about the x,y points defining the ice edge. Points near each other do not act independently of each other. Say I have 20 data points describing 50km along the ice edge, across which a wind is blowing all in one direction, and the whole 50km stretch of ice retreats. Is that one data point, or 20?

About other data fields I have:
Yes, I have air temp and sea surface temperature data from satellites and other sources as well. The sea surface temperature data is a big factor in my overall analysis, while air temperature data has no effect because of the difference between heat capacity between air and ocean.

I've already done a great deal of analysis for this project, and I have a really good idea of what's going on. I'm just getting really rigorous now and refining the numbers and plots to be precise.

Thanks again!!
Wendy

Stephen Tashi

Should I visualize "the ice pack" as a mathematically "connected" 2D geometric figure like a distorted circle. Or does it consist of several, perhaps many, islands of ice?

If I thought the ice pack as a circle, the terms "advance" and "retreat" would be clear because "advance" would men in the direction away from the center and "retreat" would mean the opposite direction. But if I have a figure with a complicated perimeter, I don't know how to define an "advance". What establishes the the reference direction for "advance"? Is it referenced to the direction of the wind? Or is it referenced to shape of the ice pack?

As I understand the method for defining the movement of a point on the edge, there might be a different number of points in the data for the edge on day N+1 than there were on day N, since the total length of the edge may have contracted or expanded. Is that correct?

Are you using some smoothing technique on the gridded data to compute the gradient of the ice concentration at (x,y) and the intersection of the line of the gradient with the (continuous) edge of the ice pack on the previous day?

( Satellite data usually goes through some elaborate transformations before it gets to nice format. It's worth doing sanity checks on it before investing much time analyzing it. For example, see if Mercator projections were done correctly. See if the same projection was used for all measurements.)

Hi Stephen,

You make some great points, and you are definitely understanding my problem.

The ice pack can generally be pictured as a circle, so that you are right to picture "retreat" as the edge moving toward the center, which means the circle is getting smaller. You can picture the center of the circle as being the North Pole.

So, picture two circles, one inside the other. The big circle is day one, the small circle is day 2. The distance travelled is the radius of the big circle minus the radius of the small one. There are fewer points on the small circle, but I can still draw a perpendicular line from a point on the small circle to the big circle. I do not need to have defined an x,y point precisely on the big circle where the perpendicular line crosses. I can get nearly the exact point needed by interpolating between points on the big circle. So each day I get a whole new set of points, and they are not associated with the points from the previous day. Good question!

I'm actually just looking at one region in the Arctic. Check out the figure below. The north pole is off the plot to the lower left. There are latitude circles drawn on this plot, and one longitude line. The longitude line goes to the pole. Pink is ocean. The colors indicate ice concentration. The ice edge is the part bordering the pink area. The thick white/gray line is low ice concentration, with the actually ice edge being 15% ice concentration. The ice concentration generally increases as you head toward the pole.

You are so right about satellite data! This ice concentration field is indeed highly processed. I actually have several ice concentration fields to compare to, as well as model data. I'm fairly confident about the position of the ice edge. The stuff inside the pack is highly variable though and errors are introduced by melt ponds being interpreted as ocean by the satellite. In addition, sometimes clouds don't allow the satellite to get good data in an area, so the final field ends up being interpolated in time and space. In my favor, I have 7 years worth of data, which I believe is enough to get the right answer statistically. There are errors in my data, but they should average to zero, while the actual signal should stand out.

http://ipab.apl.washington.edu/example/IceConcLaptev2011-021.png [Broken]

Last edited by a moderator:
Hi Stephen,

It is referenced to the perpendicular direction of the ice edge. And I think I made a small mistake in my description earlier. It has to do with WHEN to attribute the distance the ice edge travelled. Between day 1 and day 2 the ice edge moved. Do I assign that movement to day 1 or day 2? The way I explained it above, I assigned it to day 2, and found it by drawing a line to the day 1 ice edge.

I actually went the other way. In the picture above, we have the ice edge on one day, as well as the average wind vectors for that day (day 173). What effect did the wind have on the movement of the ice? We have to look where the ice edge is the next day (day 174). Then the distance travelled is associated with day 173, when the wind causing the movement was active.

Wendy

Stephen Tashi
I don't understand whether the ice "edge" is a mathematical contour line or whether it has an area. The gray part of the picture looks to be an area, so for a given grid square, all 4 corners of it might be in the edge.

Maps of terrain elevation can have closed contours. For example, their coud be two closed 400 ft level elevation contours on a map that showed two distinct hills. Is it correct to day that in the ice edge data, such a pair of closed contours for the 15% ice concentration is regarded as an error in measurement? If so, would it be "cleaned up" and removed from the data?

Let's see if I understand other details.

A line from the N pole to the point (x,y) defines the direction of advance at the point (x,y), so I can imagine this direction as a unit vector U with tail at (x,y) pointing along the line in the direction away from the N. Pole. If the vector V represents the movement of the point (x,y) then the projection of V on the unit vector U measures the extent of the "advance" of (x,y). If we want to classify the movement as "Advance" or "Retreat", we call a positive projection "Advance" and a negative projection a "Retreat".

I don't understand the goals of the analysis. You said you wanted to determine which wind speeds and directions affect the advance of the ice edge. Suppose you have done that. What predictions or actions could that lead to? There is the type of scientific paper that presents a sequence of statistical investigations that portray the author(s) taking a completely open minded approach and being ready to discard a theory if any stage of the investigation had been inconclusive. I don't see how your proposed analysis fits into such a narrative.

If one was going to argue in favor of futher investigation of wind on ice edge movement, I'd find it more convincing to see an analysis of the wind directions relative to the movement directions instead of being binned according to their navigational bearing. For example if we take the direction of movement of (x,y) as the reference direction defined by the unit vector u. then at each (x,y) we can plot the magnitude of movement in that direction on the x-axis and the projection of the wind vector on u on the y-axis. That would indicate whether local movement is correlated with the component of wind in that direction..

There are analyses of coarser phenomena that would also make it plausible that a theory can proceed. For example something like a correlation between total decrease in ice area vs mean wind direction taken along the ice edge.

Distracted by such thoughts, I have avoided thinking about how to do the analysis you proposed and right now I must go visit someone, so I can't think about it!

Hi Stephen!

I really appreciate your attention to this problem! I know we haven't actually talked about whether my methods are correct yet, but the first step is of course determining whether the question makes sense in the first place.

Yes, you are absolutely correct about the winds needing to be relative to the ice edge to make any sense! The wind directions I'm using are all relative to the ice edge. I'm sorry I was not at all clear about this! I should have been say "Wind direction with respect to the ice edge" the whole time. So for example in the plot above, the winds blowing over the pink area (the ocean) are all going in about the same direction, but relative to the ice edge, we get winds in all directions. In some places it's blowing straight into the ice. In some places it's blowing away from the ice. (I'm also sorry I forgot to point out in that map figure that the gray area is land. That may help with the interpretation.)

Retreat and Advance in this analysis are with respect to the ice edge as well. I find the normal to the ice edge, and retreat and advance is measure by how far the edge moves in this direction. The direction is different for each x,y point.

This project I'm working on really does not revolve around this analysis. Determining wind speeds and directions that affect the movement of the ice is just one little piece of information I use in a much more extensive analysis. I know the approximate answer already, and changing it slightly doesn't affect the overall results. In the course of this part of analysis, though, I have discovered that my original definition of what constitutes "Ice Retreat Winds" vs. "Ice Advance Winds" is too simplistic. I was just using the wind direction relative to the ice, assuming that the only effect from the speed of the wind would be on the amount of the resulting ice movement. Not true... makes sense now that I think about it. But even this revelation will not affect my other results.

The powerpoint slide below gives the results of my statistical analysis. It might not be precisely right, but the gist is right, I'm sure.

The location of each bin in this plot shows the magnitude and direction of the wind with respect to the ice edge. Straight up means wind was blowing exactly into the ice. The arrow on the plot shows this.

The color tells the degree to which it affects the movement of the ice edge. White = essentially no affect on retreat or advance. Yellow/Orange = this wind is causing ice retreat. Blues = this wind is causing ice advance.

This plot will be confusing to many because it seems like all the retreat directions should be in the top half of the plot, and all the advance directions should be in the bottom half. The reason we don't see that is because of geostrophic effects due to the rotation of the Earth. Things turn to the right in the Arctic.

http://ipab.apl.washington.edu/example/WindSTDs.png [Broken]

Last edited by a moderator:
Stephen Tashi
That is a very informative graphic. It could be presented without mentioning "standard deviation". If you mention standard deviation, you'll have to define what random variable is involved. My interpretation is that each bin is assumed to be the realization of a binomial random variable consisting of M independent trials (M is the same for all bins) with probability p = 0.6 of success on each trial and "standard deviation" means the standard deviation of such a random variable. The graphic makes it implausible that this model is correct because it shows a pattern depending on the direction and magnitude of the wind.

What's the goal of your statistical analysis? I think there are many ways to set up a hypothesis test that would reject the above model. But rejecting the above model only rejects the idea that all the bins are samples of the same random variable. It doesn't confirm a specific pattern for how the data is generated.

Hi Stephen,

Ok, I feel like we're getting to the heart of matter now. This is new territory for my brain... I do a lot of regression and curve fitting and such in my work, but this is different. So I'm trying to get my head around this:

But rejecting the above model only rejects the idea that all the bins are samples of the same random variable. It doesn't confirm a specific pattern for how the data is generated.

So let me just think out loud for a minute, and if you could point out where I'm wrong thinking, I'd appreciate it.

I have what I've referred to as a "Big Overall Distribution". I'm gonna call it the Parent Distribution now. The random variable for this distribution is the distance that the ice edge travelled. I simplified it by classifying the distance travelled as either a Retreat or an Advance. So my random variable is D for distance, and

p(D=Retreat) = 0.6

But this Parent Distribution does not contain any information about why p(D=Retreat) = 0.6

So I can determine whether any particular subset of the data follows the pattern of the Parent Distribution, and I can hypothesis the reasons for this, but I can not really conclude based on this model that these are the reasons?

Is that what you mean?

Stephen Tashi
But this Parent Distribution does not contain any information about why p(D=Retreat) = 0.6

That's correct.

My point is that hypothesis tests are usually set up so that "rejecting" the null hypothesis leads the reader to accept some interesting theory - like "left handed people are more likely to toss a coin so that it lands tails".

If we "reject" the hypothesis that the that retreats occur independently and at random in the bins, the theory that we accept is something like "the probability of retreats is affected by wind speed in direction in some manner". But that theory could describe a situation where bins that had most of there retreats were scattered over the various directions. So if you want statistical evidence in support of a specific theory like "wind in the direction of retreat for an edge increases the probability of a retreat", you'd have to do further statistical analysis.

The simplest way to argue in favor a theory that predicts a pattern is to state the theory as a predictive model and then quantify how it fits experimental data.

Applying statistics is subjective. In certain fields of study, certain statistical methods become traditional. For example, If you are writing a report for a journal, the editors and referees of the journal will have particular tastes about statistics and it's best to look at papers that have been accepted and papers written by the editors and referees themselves to see what those tastes are. I don't know anything about traditions for analyzing the effects of winds. My personal taste is that if I saw data like your graph portrays then I wouldn't care to read more than a sentence about how we reject the hypothesis that wind has no effect on advances or retreats. That hypothesis seems too much of a "strawman" to be taken seriously. I'd want to spend my time reading something that argued in favor of "accepting" a fairly specific causal relation.

If you do want to analyze the hull hypothesis that the retreats are tossed randomly into the bins, I think you can use some variation of the Perasons chi-squared test. The scenario is that we have a given number of, say, balls, and a given number of them are marked "R" and the rest are marked "A". We have a given number of bins. For each bin has a certain number of balls that must be put into it (so we don't need to require that they all contain the same number of balls. If we toss the balls into the bins "at random" subject to those constraints then we can find the distribution of the chi-squared statistic. If the value of the statistic we get from the observed data is in the very improbable region of this statistic, we "reject" the hull hypothesis.

You might get an impressive graphic by displaying values of chi-square for the bins instead of the values of standard deviations. I'm not an expert in chi-square tests, but I'm sure there are forum members who can guide us.

Stephen! Thank you very much for your reply. I get pretty busy during the week, but I will be replying in detail soon.

Chi-square test! Of course! That shook the dust off an ancient unused neuron in my brain, and I've been studying up on it.