Forum menu
@willard, I get what you're saying and I don't doubt that the process used would have thrown up a different issue. But if they were using a newer version of Excel would we have seen this specific problem?
@willard, I get what you’re saying and I don’t doubt that the process used would have thrown up a different issue. But if they were using a newer version of Excel would we have seen this specific problem?
It may have avoided the problem they had, but whats to say something else didn't fall over in the process?
IMO, it doesn't matter what program you use. Having a robust process is key, bringing the data in, check it, process it, check it, make it into a pretty picture for people to digest, check it, release it.
Excel is the defacto tool, everyone has it on their pc and with a bit of training, can navigate round it.
I used to do a lot of Matlab scripting, was once asked if it could be done in excel, politely told said manager, that excel was akin to trying to knock a building down with a pencil. Once it was all processed, checked, calculated, checked, it waas then fired at excel, as that was found to be the easiest way to manipulate maps/calibrations without faffing with licences for other stuff.
Sounds like they were going from csv -> xls -> postgres. Nobody in their right mind would design this up front, not only is the xls bit a pointless step, the use of xls files is a mare for licensing, platform, and library support. There are also a gazillion things that will quite happily do csv -> postgres for you in a reliable/validated/audited way.
Assuming it is not total incompetence (assuming) then this stuff is the kind of stuff that only comes about because it evolves over time. That I can completely understand tbh. Still I expect they are getting some investment here sharpish.
Haven't read the whole thread, but has anyone suggested adapting Pokemon Go?
In the meantime some consultant with excel, python and possibly R will have knocked up a quick and dirty solution in a couple of days which answers the questions being asked by people in boardrooms who don’t care about IT. Guess which one they went with?
Pretty much sums up my career path, I create scripts which solve problems deemed too complex / expensive to fix and then my scripts become the official 'solution' and they take on a life of their own. 120k lines of VBA and 50k lines of bash later I have a full time job maintaning them 😉
I basically specialise in inappropriate software solutions...
You guys are funny.
Hope you have got your tenders in for the new customs system that needs to be built. They would have snapped your hand off, “three months, easy”.
Depending on the robustness, I suspect a specialist company would find it this easy. Its probably the paperwork and bureaucracy that would take the longest. In one of the collaborative projects I worked in, the data analytics took longer than they should, not because of the system itself-this had been ready for a while, but because the content/variables hadn't been designated clearly. I mean from the info of the website every row data set is what:
"name" "region1" "region2" "region 3" "test result" "date tested" "date test result" "hospital detail 1" "hospital detail 2" "hospital detail 3"
I'm no expert in data science, but the method they used is just one step above from printing all the sheets and putting them into a binder to then count manually. This is why people are so frustrated, there is whole spectrum of options between what they did and a fully functional deluxe industry standard cloud database(as the ones described above). The data is not what are joe doe's favourite colours, the data is concerning a global pandemic.
Next thing we know the excel spreadsheet is not backed up into the network and opened from a USB drive.
P.S. Please look at this PHE website and tell me that this is all built on Excel:
You could use excel to classify the data in the parameters given. Its easy to write a script in whatever programme/language to import data from excel. As I said earlier, excel is good at being a jack of all trades, however for this case it should not have been used.
I'm getting far too wound up about this mess issue....
MongoDB because the data is “big data” (complex and not very relational).
Hmm yeah.. with map/reduce jobs for the tracking and tracing part. I don't think that would take very long to implement tbh. I wonder if I could knock something up as a demo..?
Once the minimum viable product was up and running, then you could do other fancier stuff with the data since it's in a generic form.
The very, very basic question is how do you get the data out of the various lab systems and into your fancy number cruncher?
The very, very basic question is how do you get the data out of the various lab systems and into your fancy number cruncher?
Give me a few hours and I'll have a prototype in Excel up and running.....
The very, very basic question is how do you get the data out of the various lab systems and into your fancy number cruncher?
A temp on minimum wage cut and pasting into excel?
You can use a lot of different ways depending on the formats/ways one will export data and the other will take it in.
JSON, AJAX, CSV, XML... You can go for a totally custom format and put it over a network using a REST API if you want, but it relies on people having a clear idea of what they want to do and how they want to do it; all have potential issues and advantages.
The example @footflaps gives is pretty much spot on for a lot of software that evolves over time from "handy script for chopping small bits of data" to a major part of back office. People _know_ that it needs a proper solution, but the cost of making the change, of ripping it all ot and replacing it with something that is designed for the expanded requirements, is far bigger than keeping one person in support hell. Hell, I've seen quotes of 18 months elapsed for a team of people to replace one shitty broker. A TEAM. Not just 18 months of dev effort, A TEAM.
JSON, AJAX, CSV, XML… You can go for a totally custom format and put it over a network using a REST API
Lovely. Very modern.
Now what if the lab's system runs on software that is 10 years old and was never designed to make REST calls to the internet? In fact, the design brief at that time was that it should be isolated from the internet due to the nature of the data.
See where I am going with this?
The very, very basic question is how do you get the data out of the various lab systems and into your fancy number cruncher?
I use lab equipment for my PhD. My results go from the inevitable Excel output file straight into a suite of Matlab scripts that I wrote. I've also built a Matlab database, which then exports into R for analysis*. I'm someone with no computer training, this just seemed like a way to save time** and avoid errors in my PhD. I naively thought that people doing more important things might have robust data systems in place, especially given the large proportion of the population who seem to work 'in IT'.
* Via csv file, admittedly.
** Classic coding 'timesaving' by spending weeks writing scripts to save a few minutes per experiment. I'll 'break even' in about 10 years I reckon.
But if they were using a newer version of Excel would we have seen this specific problem?
As I understand it it was more the Excel file format they were using (XLS rather than XLSX), even with a current version of Excel you can save as legacy XLS if you want. Generally you only do that for specific backwards compatibility requirements, that said the government agency I work in is using Excel 2010 but has the default file format set as XLS (don't ask). I struggle to believe PHE are using Excel 2003 (the last version before XLSX) but maybe they are...
Lovely. Very modern.
Now what if the lab’s system runs on software that is 10 years old and was never designed to make REST calls to the internet? In fact, the design brief at that time was that it should be isolated from the internet due to the nature of the data.
See where I am going with this?
That is what ETL is for. Regardless if the data source is legacy onsite system, CSV files or nice JSON via API. It will get it into your data warehouse / lake.
Everyone has to get out of the mindset of what they use, what is feasible in the lab or their office. Look at what Facebook, Tik Tok or Snap do. Whether you like them or not they deal with much more complex challenges, their management of data is pretty public and easy to replicate. Pretty sure if the Gov called an architect or data engineer at any of them saying they needed a little help they would have sketched out what to do on a napkin over a cup of coffee for free to help save the country and all.
What's doing the E for Extract bit? Devil is in the details.
Unless it's E for Excel, which is what they are doing.
ETL = excel, truncate and load 🙂
Legacy lab software designed to handle large data sets is fine even with no API.
The point here is that Excel should not be anywhere in the chain of a system meant to track the population of a country.
There are so many well known cases of large scale blunders caused by Excel. Here are a few for entertainment: https://www.bbc.com/worklife/article/20130903-five-awkward-spreadsheet-snafus
I’m bailing on this thread. Hope someone at the gov may find it useful. I need focus on actual cycling when I visit this place 😂😂
It's not even Excel at fault, whatever solution they use, no matter how (in)appropriate should have had a test process around it which spotted that half the data points vanished between steps x and x+1...
There never was any problem with any of the hardware or software.
The problem was totally and completely in the wetware.
Always has been, always will be.
ETL = excel, truncate and load
😀