April 30, 2024

The mountain of shit theory

Uriel Fanelli's blog in English

Fediverse

Today’s review(s).

I decided to make a (S)review of a book that I happened to read, out of pure curiosity, since today in IT those who do not produce data do not exist, and since if you work with companies " hail” you are – whatever you do or don't do – dealing with significant amounts of fantastic data (and where to find it).

The book I recommend you buy only if a nuclear war caused a toilet paper shortage is the following.

fluff

This book contains everything that data science aims to solve, presenting problems as if they were beautiful, desirable and useful things. In fact it is the exaltation of "cooked data".

To explain it to you I will use an anecdote.

In 2018 a certain Italian CEO called me and other seniors (=trad: old) to build a big data analytics system, for a reason: he only received cooked data, i.e. the data that according to the book are the "right" ones because they don't "discriminate".

He received data every so often in the form of presentations. Being the CEO of a holding company that then owned ~40 local telcos, he had a problem: every telco worked WELL. Speaking of the presentation. Everyone was wonderful, they did everything more and better, and for less money. Customers increased and were increasingly satisfied.

The problem was that the financial reports said the opposite. The company was racked by unnecessary expenses and failed projects and was getting worse and worse.

So the CEO (himself!) asked us one thing: to take EVERY data from EVERY device owned by the company, put it in a big Hadoop, and then make sure that he received the data “As a Service”, that is, when he needed to know something, he would ask one of his personal assistants to write all the relevant mathematics on a nice Tableau server, to get the graphs or tables and numbers he wanted.

Previously, however, each team leader (or "manager") made a report in Excel. Which was then translated into Powerpoint and "presented" to the superior. Who then aggregated everything, made the Powerpoint and presented it to the superior. And after six or seven of these mergers, the company was wonderful.

This is because, in order not to "bore" the superior with decimals, they were removed, with bizarre rounding. The data was combined in an inexplicable way. That is, they were cooked. An IT jargon when talking about data is “cooked”. Cook yourself. These are the data that the author of the booklet above likes.


As you can understand, even for a group of 40 people the beginning was a nightmare. When we asked the contact people in the national telcos what data they had and where they kept it, the effect was distressing. We ended up logging into EVERY server via ssh, checking the files opened for writing to understand what logs were being written, and then checking on the individual systems what databases there were and what patterns there were. Three months to understand what data it was, and how to manage it according to the "Privacy by design" documentation. I had to do it, and it was a nightmare.

So: it's easy to say "data". You won't find any trace of this in the book, because evidently the lady who writes has never WORKED in the sector (she's a journalist and "contract teacher"). However, the selection of data, or what a “data” is, is crucial. In the sense that the answer is “ALL”. If a piece of data exists, then it is a piece of data. End.

Why this? Because if we forget to count it, we forget to protect it and if we forget to protect it, the tata breach will end up in the newspapers and we will pay a fine of up to 4% of the turnover. The law says so, which also comes from the GDPR. So the answer to the journalist is clear: what data is data? Everyone. BUT she doesn't work in the field, she TALKS to her about the field.

However, having had the data, classified it, decided which to aggregate and which to keep raw, and consequently which to anonymize and which to pseudonymize, (the journalist doesn't even mention this in the book, apparently for her data is a magical concept which is then magically manipulated by magical “algorithms”), then we began to do what the data processor, this mythical figure foreseen by the GDPR, does.

For those who are not in the profession, an aggregate figure is a figure that DOES NOT report any specific identity, that is, "Romagna butchers cheat on their wives 30% of the time". The alternative is raw data, or PII, which says "The butcher Ivo Balboni cheats on his wife with the baker opposite, in via Scappavia", which instead allows both Ivo and the baker to be identified.

However (I don't know if it is still considered Big Data today), we were pulling up around 16/17 PB per day. Let's say "pretty".

But the funny thing was the dichotomy between those who think like the journalist and the technicians who have in mind the measurable reality, which coincidentally is the only reality, because if a reality is not measurable it should be archived in the chapter “The Tooth Fairy”.


So we have a political clash between two philosophies. The technicians, who the CEO pays because he wants the truth, and a political approach, that of "cooked" data.

I say "political" because the data was cooked for a political purpose

  • Look how cool we are, don't cut our budget
  • look how cool we are, promote us
  • look how cool we are, pay us the bonus
  • look how cool we are, be an endorser for us in the internal media

each of these actions is political, because assigning budgets and promoting are purely political actions: the allocation of resources (the budget) is probably one of the most political functions of a human group.

As a result, ALL the groups we contacted to get the complete list of data asked us for an IN-PRESENT meeting (i.e. they came to Germany in person, hoping to avoid the Minute of Meeting or to be recorded), with the exact same questions.

  • Yes maaaaa…. what do you do with the data?
  • We make them available to the CEO, who sets us KPIs
  • Yes, but… doing KPIs like this is dangerous, because the CEO doesn't know many things.
  • The things the CEO needs to know are in the data, in fact.
  • Well no. For example, the CEO will receive the time it takes us to resolve a priority one ticket, but the LOVE we put in does not appear in the data.
  • I don't think it's a given. And I don't think it's relevant.
  • But that's not true, customer satisfaction is achieved by doing everything with love for the customer. We are “Customer Obsessed”. BUT this is not seen in the data.
  • I don't think the law allows us to keep mental health medical data in our hadoop.
  • Yes, but then the CEO only sees how much time we put in, but not the LOVE we put in.
  • Life is hard. Already'. But death is worse, they say.

I called them Catbert meetings because in the end we had to be the Catberts of the situation.

The request, that is, was to send the data to them FIRST, so they could see what we were telling the CEO secretly (according to them. In reality the data was also available to them, if they requested), but according to them hidden. That is, the CEO was stealing something from him, directly accessing data that was "theirs".

In short, we had TWO mentalities:

  1. that of us technicians, for whom a piece of data is the transcription of an event that actually happened, or at least of the fact that the event really happened.
  2. that of politicians, for whom the data exists if and only if its disclosure supports a given political thesis i.e. "my team deserves money".

When you do this in a political group (EVERY human group is necessarily political), it doesn't just result in "panic". The result is an attempt to water down or modify the data themselves, or to seek a "political" reading that would "justify" them.

For example, one of these local telcos said it had 9 million users, but there were actually seven, the others were SIM cards that had never entered the network. Since they didn't complain, they were all satisfied customers. Once those two million SIMs that never entered the network were removed, things changed a lot and customers weren't so happy.

Ditto for some systems, one of which I remember because in fact the only users were the testers. To make sure the system worked well, the crafty manager had hired a company that did external monitoring (and so far well) by simulating real users. He simulated many of them to also check the reaction under stress, even the behavior depending on the geographical area of ​​origin, etc. And then he never deleted them. So the system had a lot of users, and an increasing number. If we removed the test ones, only company employees remained, who had been invited to use it. Wow.

Several situations also emerged at the border. Local telco from nation A and local telco from nation B, and A claimed to have better customer satisfaction than B. Too bad that on the border between A and B there were lots of B's ​​SIMs parked at night, i.e. people who lived in nation A but they bought country B SIM cards.

The cooked data, that is, was mendacious. And it was mendacious because it was made to influence political decisions (internal to the company, but still political). After all, the company had more employees than citizens in San Marino, so to deny the politics of internal decisions we would have to deny that there is politics in San Marino.

Spoiler: there is.


Columbro's book is simply the exaltation of the cooked data, and a pseudo-technical explanation with hackneyed examples,

Inside the company, the data was cooked to satisfy the political thesis "my group deserves more budget and more promotions", while in Columbro's book the data should be "cooked" so that it "does not discriminate", that is, it satisfies the political thesis of "the numbers of group

The problem with cooked data is that the data is not alive by itself. They can be combined. In the case I told you about, I happened to see the anti-physical "negative latency" number, because someone had created his own table of cooked data (using a proxy with a timeout) to demonstrate that his system was very fast.

The problem is that the data was then reused, resulting in negative latency. It took a couple of weeks to discover that they had put a cache proxy in there, with a nice timeout.

The aim was to say "but my numbers, which always come from the same Big Data, say something different". It's a technique of polluting analytics data, creating your own data and then saying that your processing is wrong.

The trouble with this is that it actually has side effects. For example, creating "shamanic" systems that are aware of the response two seconds before the request, simply because someone took some cooked data (= without the proxy logs involved) and decided to make their own numbers.

Likewise, I can take the data cooked in social terms, according to Columbro's ideology, and give an example.

So, we have that in the USA the prisons are full of people of color. This is 17% of the population, but in prison the percentage is double. According to Columbro, this data discriminates because it is against blacks. So we should cook it taking into account N things that are not really measurable in the specific case, and say that no, the Necris who are really in prison are always 17%.

Thus we have satisfied the political principle. Well. The trouble is that sooner or later someone who has to make the budget for prisons and the budget for social services could decide to use the cooked data. If you ask someone to define the budget for blacks in prisons, they will immediately ask "but how many are there"?

The trouble is that he will get an optimistic figure, which saves the face of blacks, and then leaves them without a budget for social services.

Moral: cooking data is dangerous, because you never know where it ends up, and you can't be consistent with everyone. The only thing "consistent with everyone" is the reality of the events, AND THEREFORE, the best way to do it is not to cook them.

This is not to say that there aren't scientists and even hired technicians who cook the data: whoever put a proxy with a timeout to falsify the performance of the system WAS a technician. But he didn't understand that data can be reused in a context where the meaning attributed, the bias, is exactly the opposite. Or, in contexts in which the correction produces mathematical disasters, such as negative latencies or divisions by zero.


The strategy of people like Columbro, however, is to demonstrate that there is good data and bad data, that is, that all the data is wrong, and only if a person from the RIGHT political party corrects it and for the RIGHT reasons, then the data is fine.

In practice it is a field that practices transcendental aesthetics: they enunciate a thesis that pleases because it is aesthetically beautiful. At that point the data arrives and disproves the thesis.

Scientific culture would like that if the data does not confirm the theory, then the theory is wrong. But he likes the theory. Thus, faced with the discrepancy between data and theory, rather than correcting the theory they propose to correct the data.

And they present it, adding premature superfucks, as a scientific operation.

And this is why I advise you against buying the book in question:

  • there is no science.
  • there's no math.
  • there are many terms used inappropriately, starting from "algorithm", arranged haphazardly to make people believe that there is science and mathematics
  • it teaches nothing about how data should be processed
  • Columbro in fact shows that she does NOT know what "data" is.

Buy it only if there is no toilet paper within a 200 km radius.

It also has some "pros": it makes you understand how journalists screw up, with the pretense of producing "statistics", the bullshit you read in the newspapers.

Uriel Fanelli


The blog is visible from Fediverso by following:

@ uriel @keinpfusch.net

Contacts:

Leave a Reply

Your email address will not be published. Required fields are marked *