This morning I came across this article (via Metafilter) by Paul Waldman, editor-in-chief of the Gadflyer, about how nationwide polling works. Waldman brings up some good points that a lot of people miss…mostly very basic statistics stuff like how to evaluate a poll using the margin of error. He also explains the theory behind polling, which is what I’m the most interested in:

As far as the poll is concerned, your opinions have been taken into account, by someone just like you. The essence of survey sampling is that you don’t actually have to interview everyone to get a good idea of what everyone thinks. As long as everyone has an equal chance of being included, we’ve created a “random” sample, which is the essence of good survey design.

Waldman does not go into how the margin of error is determined. I believe I learned how when I was studying statistics in high school, but I couldn’t remember, so I went searching for more information. I found this article by Matthew Mendelsohn and Jason Brent at Queen’s University (side note: Anne of Green Gables went there, I think, back when it was Queen’s College!), Ontario. It’s brief but very informative, and it includes a general way to calculate the margin of error. Their description of margin of error struck me, though, because I felt it made me understand the concept differently than Waldman’s article did:

Most polls report a “margin of error”. If a poll reports a margin of error of “3.1%, 19 time out of 20,” this means that if you were to conduct exactly the same poll at exactly the same time again (you would end up surveying different people, however) 95% of the time (19 times out of 20) the results would be within 3.1%, up or down. So, if you repeat a poll one month later and find results that differ from your previous results by more than 3.1%, you can be “95% sure that public opinion has shifted.” It is still possible that public opinion has not shifted: 1 time in 20 you will receive a result that differs from your previous result by more than 3.1% even though opinion has been stable. This is commonly referred to as a “rogue poll.” This does not mean that the poll was poorly done; it is simply the case that on the basis of chance, 1 poll in 20 will differ by more than 3.1% — but it usually won’t differ by much more than 3.1%.

“Margin of error” assumes that the sample is a random, representative sample of the population. It also assumes that the questions were appropriately worded and that interviewing was of a high quality. “Margin of error” therefore is only a statistical calculation based on probability and the size of the sample; it says nothing about the quality of the poll itself.

What I was interested to know at this point was how they determine that, as Waldman says, “everyone has an equal chance of being included”. I figured that pollsters must use some sort of categorization, like income levels, location, maybe even skin color and sex…so, I wondered, what are the categories, and how specific do they get, and how do they know that there is an equal chance for people from all categories to be picked? If the poll covers 1000 people, then do the pollsters assign 1000 categories to the people of the United States?

It’s a little more complicated than that. David Ropeik at MSNBC explains the process like so: first, pollsters take all phone numbers in the US. (All of them? Even unlisted numbers and cell phone numbers? He doesn’t specify. Fortunately, Gallup does; see below.) Then they “stratify” the numbers by geographical area:

Do you just pick 1,000 phone numbers completely at random? No, because there are different voting patterns by region and by state. So pollsters determine, from previous elections, how many people vote in each region of the country. Twenty-three percent of voters are in the East, 26 percent in the South, 31 percent in the Great Lakes/Central region, 20 percent in the West.

So you want to make sure that 23 percent of your 1,000 phone calls, 230, go to states in the East. Another 260 calls will go to the South, 310 calls will go to the central region and 200 calls will go to the West. Pollsters also break down the voter turnout by state, and make sure each state gets the appropriate number of calls.

Here comes the interesting part. In order to categorize the votes, demographic information is also taken during the poll. Ropeik says:

After a poll is done, the initial results are grouped by these demographic categories. Let’s say that of the people responding to a poll, only 40 percent are women. The pollster adjusts the results from women up, and the results from men down, until they accurately match the American population. If only 2 percent of the respondents were Hispanic, the pollster juggles the Hispanic response up, and the other groups down, until everything matches “real life.” They adjust all their findings to accurately match America?fs demographics in categories of age, race, religion, gender, income and education.

It may sound like a less-than-random tinkering with the numbers. But remember, everybody out there had their chance to be called when those random phone numbers were picked. These adjustments are done to more accurately reflect all the subparts of the overall universe of voters. You might call this fudging the numbers. Pollsters call it “weighting.”

Weighting makes a certain sort of sense, when you think about making the poll results match nationwide demographic data, but think of it this way: in the times when you inflate the numbers for a certain demographic, you are projecting the opinions of a very small percentage of that demographic onto the whole group. I’m not sure that this can be considered “fudging” anymore…it seems a little too inaccurate. Do you truly have a random sample of the demographics? No, what you have is a random sample of the United States.

It should be obvious by now that you can’t use a nationwide poll to gauge how Hispanics or women are voting. You would have to know exactly how many respondents were Hispanic or female, and you’d have to calculate the margin of error based on those numbers, before you could make any claims. The margin of error would likely be so high that you couldn’t make any claims at all. In order to evaluate a demographic subset, you’d have to take a completely separate poll!

However, I’m starting to think that doing separate polls for each demographic would be the best way to go. Only poll men, women, Hispanics, African Americans, etc., and then create a huge aggregate of the responses. This would still be inaccurate–how many of the women you polled were African American, for example?–but it would get closer to the “random” sample that statistics require.

Ropeik included a rather flippant explanation of random sampling, involving a batch of 100 marbles:

If you are really random about the way you pick your batch of marbles, 95 times out of 100, your batch will accurately represent the whole collection. Statisticians have fancy numbers to prove this is true. Decades of polling experience backs them up.

Well, gee, as long as they’re “fancy” numbers. As far as “decades of polling experience” go, well, if they do the same thing the same way for years and years and get the same kind of results, I’m not sure why they’re surprised.

Obviously, that was an oversimplified answer, a response-in-kind to Ropeik’s oversimplified explanation. I want to know exactly how and why these decades of experience have caused them to believe in their statistical techniques. Ropeik seems to want us to accept that they know what they’re doing on blind faith…and, indeed, this is the point at which most explanations falter or gloss over the process.

Ropeik described the process as starting with phone polls. However, as Waldman mentions, many people polled on the phone don’t respond. (Ropeik explains it as follows: “It takes 7,000 to 8,000 phone numbers to get 1,000 useful responses. Some numbers aren’t working. At some, no one answers. And only a third of the people who answer agree to participate.”)

Gallup polls are probably the most trusted and respected polls in the US. But even they have their issues. They claim (as of 1997) that 95% of Americans have telephones, so now all their polls are conducted by phone…and, further, they say:

In the case of Gallup polls which track the election and the major political, social and economic questions of the day, the target audience is generally referred to as “national adults.” Strictly speaking the target audience is all adults, aged 18 and over, living in telephone households within the continental United States. In effect, it is the civilian, non-institutionalized population. College students living on campus, armed forces personnel living on military bases, prisoners, hospital patients and others living in group institutions are not represented in Gallup’s “sampling frame.” Clearly these exclusions represent some diminishment in the coverage of the population, but because of the practical difficulties involved in attempting to reach the institutionalized population, it is a compromise Gallup usually needs to make.

They do not say what percentage of the population they are leaving out by polling this way. They do explain, however, how they pick their household phone numbers:

In the case of the Gallup Poll, we start with a list of all household telephone numbers in the continental United States. This complicated process really starts with a computerized list of all telephone exchanges in America, along with estimates of the number of residential households those exchanges have attached to them. The computer, using a procedure called random digit dialing (RDD), actually creates phone numbers from those exchanges, then generates telephone samples from those. In essence, this procedure creates a list of all possible household phone numbers in America and then selects a subset of numbers from that list for Gallup to call.

While they have eliminated the problem of unlisted numbers, I still have to trust somebody’s computer program. How do they determine which phone numbers are attached to residences? How do they allow for cell phones? Sean and I don’t even use our land line phone–it’s installed, but we don’t have a phone plugged into it. Sean’s parents don’t have a land line phone at all. Are we just weird exceptions, or is this a growing trend? If the latter, how do pollsters deal with it? (Of course, this article is from 1997. They may have new procedures not outlined here.)

Gallup does have a very interesting random selection process that occurs once a household is reached (emphasis and typo Gallup’s):

Once the household has been reached, Gallup attempts to assure that an individual within that household is selected randomly – for those households which include more than one adult. There are several different procedures that Gallup has used through the years for thiswithin household selection process. Gallup sometimes uses a shorthand method of asking for the adult with the latest birthday. In other surveys, Gallup asks the individual who answers the phone to list all adults in the home based on their age and gender, and Gallup selects randomly one of those adults to be interviewed. If the randomly selected adult is not home, Gallup would tell the person on the phone that they would need to call back and try to reach that individual at a later point in time.

I really have no problem with this, or with Gallup’s question-asking process. Their methodology seems to be the best it could possibly be in these areas.

However, I found this interesting:

Once the data have been weighted, the results are tabulated by computer programs which not only show how the total sample responded to each question, but also break out the sample by relevant variables. In Gallup’s presidential polling in 1996, for example, the presidential vote question is looked at by political party, age, gender, race, region of the country, religious affiliation and other variables.

So even Gallup falls into the trap of analyzing the demographics, rather than simply providing the answer to the question that started the poll. In the case of a nationwide poll concerning the presidential candidates, the answer would be “blahblah percent favor Kerry while blahblah percent favor Bush”. That is all you can say definitively. To gauge the opinions of a particular demographic, you would have to focus on polling only that demographic in order to get a random sample of that demographic’s opinions. How can you say you have a representative sample of a certain demographic when your poll results come from the entire country–and when you’ve inflated or deflated that demographic’s results to match the nation? If someone has a good answer to this question, I’d like to hear it.

Until I know more about the process, I’m going to have to say that my position on polling is still “skeptical”. The demographics weighting doesn’t sit right with me, and polling by phone skews the data towards 1) people who have phones and 2) people who actually answer the poll (and, in Gallup polls “which track the election and the major political, social and economic questions of the day”, 3) people who live in “households”).

An interesting project for the future might be to obtain a full Gallup poll, as they are “public domain”, and make my own analysis.