April 27, 2024

The mountain of shit theory

Uriel Fanelli's blog in English

Kein Pfusch

The survey industry.

The survey industry.

Especially when the elections are approaching, the newspapers bombard people with "polls", to the point that if I go to read the newspapers, it would seem that there is behind a survey industry, whose purpose is to "statistically" demonstrate what the client wants to be told. Many wonder how statistics (which is mathematics) can be perverted in order to make such blatant mistakes, and there are different techniques.

"There are several techniques" does not mean that statistics is a fallacious science. On the contrary, it is an exact science. The problem is that it is possible to play on theorems and formulas that are not known to the public, and especially on concepts that are perhaps known to the public, but few reflect on the related formulas.

First, however, I will make a brief overview of the basic rules of a sample estimate, to align ourselves, then we will see a couple of more common methods among those in use for drug surveys: overrepresentation and "malicious" aggregation. I'm not going to touch on concepts like "spectrum" or "density of probability", because I want to clarify how little there is to be credible in the polls.

Almost all people believe that a survey is a so-called "Sample estimate". It is a process in which a sample of the population is taken and the prevalence of some data is measured, for example by asking a question like "Is it right to beat the fiancée?" (Any reference to ridiculous statistics is purely coincidental).

Now, the point is very simple: how big should a champion be? In this sense, the answer is that … depends on the answer we measure. The "simple" formula to calculate the sample of an estimate for very numerous populations is this:

The survey industry.

Where is it

  • and it is the margin of error you want.
  • p is the standard deviation you find, or if you prefer the prevalence you expect.
  • Z is a confidence index.

So, I think the margin of error is a clear concept. If you want to measure, which I know, one percent, you need a margin of error that is lower, and a lot, to the percentage point. Otherwise the margin of error becomes larger than what you are measuring.

The standard deviation is the difference between the values ​​you want to measure. If you go on a yes / no speech, then "in a nutshell" you can identify it with the expected prevalence. If for example 99% say that it is not right to hit the girlfriend and the 1% say yes, it is different from when you have a situation like 50.1% and 49.9%, for which you need a VERY bigger sample .

Z is an index that comes from the confidence you want to have. "Confidence" is not the assurance that the estimate is correct, but the assurance that the sample is large enough to measure what you want. In general, tables are used, such as:

  • 80% confidence ==> 1.28 of Z.
  • 85% confidence ==> 1.44 of Z.
  • 90% confidence ==> 1.65 of Z.
  • 95% confidence ==> 1.96 of Z.
  • 99% confidence ==> 2.58 of Z.

As you can see, to know which sample you need, you need to know what the deviation you expect. But before you've made the estimate, you don't know. So you can put the deviation at 0.5, calculate the sample, do your interviews, and then depending on the standard deviation you find … redo the bill to get the right sample.

There are also more complex formulas, for example when populations are small, but in the case of 60 million individuals we must not go into detail. What I wanted to represent is that the calculation of the sample itself is very "tricky", because the sample is calculated first, and then we start working, and then (maybe) it is recalculated.

It is up to you, when you read a survey, to understand whether the sample was sufficient, or how confident you are. Because if they don't tell you, to say, with what confidence they made the champion, in practice they are telling you that the survey was done with "a number of people that MAYBE are sufficient".

When, for example, they publish the polls of two parties that clash in Emilia Romagna and are "heads up", to say, they must have a very GREAT sample, or a very low confidence. Or a tremendous margin of error, but since the two factions are head to head and it takes little to subvert the result, it must be small. They can therefore play ONLY on confidence.

It is relatively common in the polling world to sell products with confidences of 60%, and even 55%. This is to dramatically reduce the sample, and therefore the costs (and times) of the survey itself.

At a guess, a poll like that of the Emilia elections should have involved something in the order of twenty thousand people, UP. Or, or the margin of error is very large, or the confidence is very low indeed. With inferior samples you cannot get precise numbers (if you indicate the first decimal digit you are telling me how precise you want to be) or with great confidence.

Rarely, in the surveys you read in the newspapers, you see cite confidence and margin of error, at most they tell you how great the sample was. Spoilers: even the most complex methods are affected by many factors, which make further abuse possible.

But now you will say that "hey, my survey had the right sample!" Good. Now that we know that you had adequate confidence and an acceptable error, we have not yet finished.

We should talk about polynomial distributions and other devils. Since you want a simple explanation, let's take an example instead. We need to know which artisans at home (electricians, masons and plumbers) think that housewives are unfaithful. (since this blog is hated by feminists for its sexism, so much fun).

So the problem is that:

  • Plumbers think, by 70%, that housewives are unfaithful.
  • The masons think the same, but only 55%.
  • Electricians, a category forgotten by straight porn, think they are very loyal: 15%.

Now, the problem now would be "we decide how many electricians, how many plumbers and how many masons to call. Someone naively will say that we have to divide equally by three: why should plumbers be less than electricians?

Now the problem is that in an absolute sense, electricians in a population are 3000, plumbers 1000 and masons 15000. We find ourselves with two problems: if we interview 500 plumbers, 500 masons and 500 electricians, we are over-representing the plumbers and electricians, and we are underrepresent the masons.

But if we want an accuracy of 1%, and we go to the proportions by calling 300 electricians, 100 plumbers and 1500 masons, the opinion of the plumbers is likely to be almost irrelevant, (spoiler: it is).

The composition of the sample, that is, becomes crucial if we want to avoid cherrypicking.

Our agency of "surveys", therefore, does nothing but "know" in advance what the opinion of the interviewed categories is, and "dose them" in the sample, to obtain the results it wants. If he wants more slutty housewives he just needs to add plumbers to the sample, while if he wants more caste ones he just needs to add electricians.

The sample is always the same big, confidence is ok, margin of error ok, but simply changing the composition of my sample I got the results I wanted, provided I know that plumbers think the worst and electricians the best.

But let's go on with the example: because we also know that the youngest artisans give housewives more infidels for infidels. So now the same problem arises, but on TWO dimensions. Not only the category, but also the age. And if we go further, we find that it also depends on weight, because obese plumbers think that housewives are faithful, while athletic ones do not. Matters of attraction.

So, if we are good statisticians, what we will have to do is to choose a sample that is fairly balanced with respect to ALL the partitions that have differences in distribution. There are tomes and tomes of statistical techniques useful for dealing with these cases, and in some fields such as medical statistics without ρ do not go anywhere. But this is not the case with polls. No one has ever published ρ of the sample.

You immediately understand how SIMPLE it is, simply by calling at a certain time rather than another, select a sample. But we can also select "cities" versus "campaign", so to speak, or "users of an economic telco" against "users of an expensive telco", if we know the telco market well.

But let's go further: even so, we find ourselves with bad news. But the customer wants them nice, because he has to throw them in the newspaper. What can we do?

Nowadays, these "surveys" and these "reworkings" are done using computers, and some write algorithms. We need to find a way to write a "wrong" but apparently correct algorithm. How do we do it?

Well, at this point we turn to the Holy Aggregation, also called Saint GroupBy.

With the GROUPBY we can, during the calculation, quietly increase the numbers in play. Let's take an example.

We have a group of ten people whom we ask if they will vote for A or B. The percentage is 50/50%. A nice headache. But no.

Let's say 6 out of 10 are graduates, and 7 out of 10 are married. All we have to do is group the vote with a nice "aggregation" by marital status and qualification, and we get a total of thirteen votes (six plus seven), and at this point a candidate of the two wins. It is "only" to use categories that are not strictly orthogonal, or complementary (depending on how the calculations are done).

This last "trick" of the sum (groupby (something)) was also used in the past, but it has become really fashionable in big-data systems, when the "data scientist" wants to multiply a little bit of bread and fish. For those who go to check the accounts, going to check the orthogonality of each group that is subject to some sum is not simple. If you use programming languages ​​like Spark, it's a matter of debugging the code, while if it comes to graphic tools (like Tableau or Datameer), the problem is gigantic, because the thing is lost in all the technicalities of the interface graphics. Such a trick can go unnoticed for years.

There are many other tricks, like the "left" joins where they should be "inner", and other technicalities that a savvy "data scientist" can use to introduce errors that become very difficult to notice for those who did not write the algorithm. But the point is, a lot, nobody's going to check the algorithm.

By this I mean a very simple thing: you could believe in surveys, but only if you behave exactly as you do in the world of applied sciences.

  • The raw data are published, the data used, that is all that is known about the sample.
  • The methodology used for the calculations is published in full.

Otherwise, it is possible that someone who has "eye" can see, already observing the numbers, that a certain statistic is not possible, or that the result (with the given precision) has no statistical meaning. To do this, it is enough that the "survey" remains online with few numbers, without any explanation on the methodologies, just to impress.

For example, the numbers given yesterday by an "istat survey" on the subject of violence against women do not make sense, because (looking at the speech at the top, that of the artisans), about a generic topic like "women" there are too many variables at stake: sex, age, social class, schooling, geographical origin, marital status, profession, political affiliation, sexual preferences, the same factors for the family of origin , etc. Unless the sample was HUGE, providing percentages with a decimal to the right of the comma is "at least suspect" if the questions are so generic and the partitions so numerous.

Such statistics would be credible ONLY if they published the raw data, and the calculations they used.

Without these conditions, we can safely say one thing:

The XYZ survey is pure FUFFA.

For any XYZ in the survey set.

links

Leave a Reply

Your email address will not be published. Required fields are marked *