Forum menu
Any properly validated GDP compliant computer system would have been a ****ing start.
Or even a ****ing SOP with the line limit defined in it.
Pretty sure Azure have a GxP compliant package or guidelines on how to implement a GxP validated package.
I submit data from our laboratory to PHE on a regular basis, not covid related.
Our lab system runs an SQL which dumps a load of data into Excel. I "clean" the data, convert to CSV then upload to the PHE portal. This is a system they've had in place for years so I assume they've just modified it to accept covid related stuff from the lighthouse labs.
It's worth remembering that although there's a lot of good lab technicians out there, there's a lot of experts, there's a lot of knowledge and there's a lot of people grafting, we're all just plebs (well, some of us!) I'm my lab's IT rep, so things like data uploads are my responsibility. I know nothing about computers! No formal training, just do it because the previous guy retired and it fell to me. It's no surprise to me that something like this should go tits up.
I know nothing about computers! No formal training, just do it because the previous guy retired and it fell to me. It’s no surprise to me that something like this should go tits up.
If you guys were audited by the MHRA you'd get murdered to death. I would love to be the auditor doing the rounds around your lab.
WHOOOOOOOOOOOOOOOOOOOOOOOOOOOO! We're all going to die.
I’m no IT expert but haven’t there been a lot of problems rolling out such systems in the public sector?
I know for a fact that fire rescue has no central/national database and neither does town planning. All their data resides at the local level on a hotchpotch of paper and electronic records. It also make FOI requests problematic because they can claim compiling the data imposes an unreasonable cost.
On town planning, I would suspect its a way for local authorities to hold more power. On fire rescue I have no idea. A challenge that such systems would face is dealing with hystorical standards of how data was compiled and stored, that could be a massive ball-ache. I don't think this applies in this case, since I assume they are being sent consistent .csv files. Given the time and money, this blunder seems completely unreasonable.
If Serco created the process and back-end then they’d still likely have been tied into a formal proposal, budget, design and change implementation approval process, that can take months even for relatively small projects (I work for an external IT provider to a government agency, not Serco though :p ).
I'd like to know what ****ing experience Serco has of maintaining GxP compliant databases.
Our lab system runs an SQL which dumps a load of data into Excel. I “clean” the data, convert to CSV then upload to the PHE portal. This is a system they’ve had in place for years so I assume they’ve just modified it to accept covid related stuff from the lighthouse labs.
Pretty common exchange mechanism, we use it a lot in Telecoms. Our system has dB type a, their system has dB type b. Simplest export / import is a CSV file. I write a script to dump out the data as csv and ssh it onto their Oracle server and they write a script to suck it in from csv.
@molgrips serco isn’t a gov dept and not all gov depts operate in the way you describe. GDS for all it’s failures is at least facilitating better ways of working.
Contrasting this debacle is what HMRC doing. A mate of mine is the PO on the team that stood up the eat out to help out systems (register, find a participating restaurant, and claim) in 3 2 week sprints.
Parts of the civil service with the aid of switched on suppliers and contractors do have the skills and experience to do thus sort of thing properly; serco don’t.
Simplest export / import is a CSV file.
That’s completely normal. It’s the “load it into excel and get a human to clean it up”, which is the broken link in the chain.
Our lab system runs an SQL which dumps a load of data into Excel. I “clean” the data, convert to CSV then upload to the PHE portal.
I hope you use control totals and sanity check the exported data. Excel can have a mind of its own, even with .csv
Was reading an article earlier today about it. My first thought was 1 million rows in Excel. I'm sure it used to be much smaller than that. But years ago we ran into a similar issue when someone used Excel to open up a big CSV file. Think it was a SNOMED dataset or similar. Simple way round it was to create an SSIS package to import it into SQL server.
Sounds like they've struggled to homogenise a series of disparate data sources and haven't thought about volumetrics.
TBH I've never been able to fathom why the row limit on Excel is so high. I know it's the de-facto tool to use for any sort of data and usually used incorrectly. But a limit of say 10,000 rows would be a good artificial limit to force people to put data into a more appropriate system.
Our lab system runs an SQL which dumps a load of data into Excel. I “clean” the data, convert to CSV then upload to the PHE portal. This is a system they’ve had in place for years so I assume they’ve just modified it to accept covid related stuff from the lighthouse labs.
Get SQL to output it as a CSV file rather than an Excel and you having to clean it. Should be able to use a simple view or such like to clean the data for you in Excel.
⌨⌨⌨⌨⌨⌨⌨
🐒🐒🐒🐒🐒🐒🐒
Thought I had a solution, but my monkeys won't align with the keyboards when scaled up 🙁
A mate of mine is the PO on the team that stood up the eat out to help out systems (register, find a participating restaurant, and claim) in 3 2 week sprints.
You're right, and the last two gigs I went on were run like this so things are changing. Thank god.
Get SQL to output it as a CSV file rather than an Excel and you having to clean it. Should be able to use a simple view or such like to clean the data for you in Excel.
And here is the problem. Who is going to do that? Is there access to the system to do that? Is the vendor that wrote/installed it able to make these changes (their guy quit years ago and they don't make enough money to hire a new one)? Etc. Etc.
Multiply by all the labs out there and .csv it is.
I had an issue recently importing data into SQL Server (SSIS), the import would work but depending on the size of the file it would fail so under X mb it would work, it turned out to be a permission that had to be granted to the default user internet files directory, my understanding is the size of the file that would be accepted was down to how much memory the server had, so would be different in another server with more or less memory!
This was my first thought when I heard it was to do with the size of an Excel file lol.
I find the biggest issue with Excel, especially if macros are involved, is if somebody is on a different version. Defunct formulae ob newer versions or limited number of rows in older can throw a right spanner in the works.

I find the biggest issue with Excel, especially if macros are involved, is if somebody is on a different version. Defunct formulae ob newer versions or limited number of rows in older can throw a right spanner in the works.
I have a VM with extra OS build / Office combo to test for such things. Still doesn't catch everything - Anti Virus apps hate VBA, diff sys Admins lock down machines in different ways blocking access to various directories or moving default locations around. Always some machine it won't work on. Russian users with Cyrillic alphabets were a PITA to support.
Why does the data need cleaning in the first place? Surely data validation at input should remove that need. The fact they are using excel to move patient identifiable data around is ridiculous and provides an other good reason not to give them any personal data
This is ‘chickens coming home to roost’ for the DHSC/NHSE/I/NHSDigital etc etc. Excel is the tool of choice for them and poor NHS Trusts are driven mad by the many, many data returns they have to make on Excel spread sheets. It’s got particularly bad during COVID as all sorts of non-job departments want some information to crunch so they have something to do when they have got bored of Netflix. Hopefully, this debacle might change all of that but.
It’s 2020 the answer to any data question you ask is not Excel but PowerBI 😉
According to the BBC
The issue was caused by the way the agency brought together logs produced by the commercial firms paid to carry out swab tests for the virus.
They filed their results in the form of text-based lists, without issue.
PHE had set up an automatic process to pull this data together into Excel templates so that it could then be uploaded to a central system and made available to the NHS Test and Trace team as well as other government computer dashboards.
The problem is that the PHE developers picked an old file format to do this - known as XLS.
As a consequence, each template could handle only about 65,000 rows of data rather than the one million-plus rows that Excel is actually capable of.
And since each test result created several rows of data, in practice it meant that each template was limited to about 1,400 cases. When that total was reached, further cases were simply left off.
Yes but just imagine the moment where the penny dropped and the poor guy responsible realised. Oh to have been a fly on the wall. I mean, I once ‘accidentally’ took out a large part of of the mobile network of a European capital city for an hour because I thought I was typing in another window. How we laughed (later, much later)
If that bbc report is accurate, sounds like it was an old crappy system thats been there for ages. I can’t imagine anyone using excel in this way in the last 10 years. But who knows eh!
The problem is that the PHE developers picked an old file format to do this – known as XLS.
As a consequence, each template could handle only about 65,000 rows of data rather than the one million-plus rows that Excel is actually capable of.
That's interesting and means the Guardian were wrong, although in both scenarios the issue was the sheet maxing out on rows. In that sense, the Guardian was half right.
Of course, using the 32-bit version of excel was only half the problem because excel is very much a quick and dirty 'database' or intermediary between databases. Perhaps understandable given the rushed nature of the track and trace, but otoh Boris boasted about it being world-class?!
But a limit of say 10,000 rows would be a good artificial limit to force people to put data into a more appropriate system.
And that is in MS interest why? What are the chances of them going to sqlserver.
There is a good reason it went up from 65k rows to a million and a bit.
Serco provide testing at a few sites they don’t collate the
data, there’s numerous companies involved in the entire process.
If excel is the basis of a world class track and trace system then presumably Hornby are building hs2
Essentially some-one has tried using a hammer on a screw. It's the wrong tool for the job. End of discussion.
Personally, I blame Tory voters.
Why be in the EU for consistent standards across several jurisdictions?
Let's **** our NHS instead...
To all those commenting about the ineptitude involved in using Excel for this, the start of this seems to be a CSV file which is not Excel, but a simple file format that was created before personal computers and Excel (1972 according to wiki).
Twisty seems close to the money shot with;
Involving manual processes like emailing around CSVs
Databases can import/export CSV files (as can Excel), so there is a good chance that Twisty has also got close to the bullseye with;
It wouldn’t surprise me if they were using an xls file which has a 65,536 row limit, rather than xlsx which has 1,048,576 limit mentioned in the Guardian.
EDIT - But I really like Pyros post 😀
To all those commenting about the ineptitude involved in using Excel for this,
CSV is a perfectly valid and robust data format that I use daily. And it doesn’t have a 65536 row limit.
As I said way up there the problem isn’t the tools, it’s the lack of understanding.
Most data people will have had that sinking feeling when you realise youve ****ed something up. I bet someone had a really bad Friday afternoon at PHE when they noticed that all the labs were reporting exactly the same number of test results...
As I said way up there the problem isn’t the tools, it’s the lack of understanding.
If you're using excel as a database, or intermediate/pseudo database, it isn't the best choice because it doesn't offer up much in the way of validation/safeguarding data. This is one reason they came unstuck. Systems entirely reliant on expert user inputs aren't a good choice for most applications. People are fallible.
Two pages in and no mention of PostgreSQL. Very disappointing.
Without wanting to defend the govt, and especially not wanting to offend fellow IT professionals, I suspect I know what happened. Some project manager somewhere will have asked 'What do we need to build to store all testing data for the whole country and provide real-time reporting capability which can be analysed by a multitude of agencies?'. The answer will have been a scalable cloud based data warehouse solution with real-time reporting, support for machine learning and statistical modelling, complex ETL and data QA processes and all the other bells and whistles. Then they'd have asked how long will it take (assuming money no object, which seems like a safe assumption)? The answer will have been at least 12 months to collect requirements, architect, build and test an alpha version. In the meantime some consultant with excel, python and possibly R will have knocked up a quick and dirty solution in a couple of days which answers the questions being asked by people in boardrooms who don't care about IT. Guess which one they went with?
MongoDB because the data is "big data" (complex and not very relational).
The issue with Excel was not a rows issue but a columns issue (each subject case entered as a rather then a row).
Regardless of whether they were using Excel to store primary data or not, that such a key piece of infrastructure is so flimsily constructed is shocking beyond believe. (See also RBS/Natwest offshoring support for 30 year old servers - then discovering nobody knew how the things worked. That's why there atm network goes titsup periodically).
The correct answer is:
1. Database (Google Cloud, Amazon AWS or Microsoft Azure - can be simple MySQL / S3), data streams or simple files > 2. Into ETL (e.g Talend / Microsoft / SAP / Fivetran) > 3. Data Warehouse / Lake (e.g Snowflake / Big Google BigQuery / Amazon RDS > 4. Visualisation / Analytics Layer (The bit the muppets should be using to view, extract and cut data instead of running pivot tables in Excel (e.g Looker / Tableu / Qlik).
3 months work at normal pace, with budget could be pulled off in one month.
Scalable system that can be used to model all things Covid until the end of time for around the same price the government paid for their Excel spreadsheet.
Every half decent tech co works like this. Any recent grad from a half decent comp sci course doing data at any half decent co would know.
One analyst call with any of the leading tech agencies would have pointed them this way.
Pay peanuts, get monkeys.
Oracle Life Science Database engine with a SAS analytics engine pulling data via SQL. R is for amateurs. You get what you pay for.
Software that is so old it should be on a beach in Marbella!
https://twitter.com/drrosena/status/1313066062284500993?s=21
Oracle Life Science Database engine with a SAS analytics engine pulling data via SQL. R is for amateurs. You get what you pay for.
I don't think the government IT budget would have allowed for that. **** Oracle and their ridiculous licensing terms! We've just deployed a 19 host VCF cluster, we had to deploy a separate 3 host vSAN cluster for Oracle DBs alongside it as even though the Oracle DBs would have only run in one of the VCF workload domains we would have still needed to licence all 22 hosts, which would have added £15m to the project cost (3 times the combined cost of the rest of the software and the hardware).
I think you misunderstand the “IT budget” that is at stake here. Compared to the opportunity cost of propping up U.K. PLC, the IT cost (and vaccine) pales into insignificance.
Have to take the longer view. And do it properly.
Looking at past datasets, there was a marked improvement in quality when they brought in the external experts. Little things like variable names were improved.
You guys are funny.
Hope you have got your tenders in for the new customs system that needs to be built. They would have snapped your hand off, "three months, easy".
P.S. Please look at this PHE website and tell me that this is all built on Excel:
https://coronavirus.data.gov.uk/
P.P.S. The IP address for that site resolves to Microsoft in Amsterdam. Shock horror Azure.
INRAT but I suspect it is pie in the sky software delivery stuff. As someone who has worked in IT for decades managing the delivery of software projects from expert teams of developers. I think I would stick with Excel.
On town planning, I would suspect its a way for local authorities to hold more power.
It's because it's been delegated to Local planning authorities since 1947 and nobody has put together the technical framework or law amendments to co-ordinate the data management.
Our lab system runs an SQL which dumps a load of data into Excel. I “clean” the data, convert to CSV then upload to the PHE portal.. I know nothing about computers!
The system in the lab isn't providing a fully processed data output. Therefore, it is being manually processed by somebody with a low level of confidence in their own competence and there doesn't appear to be any reviews/checking. Inefficient use of resource and prone to introducing error. I'd hope that this is just reporting data and such a process wouldn't be tolerated if people's health could be impacted if a mistake is made.
Confession: I sometimes open and read IT threads on here that I have no hope of understanding because I find it weirdly reassuring. I didn't intend for this to be one of those threads but it certainly is.
Also, am I right in thinking that this specific problem would not have happened if they were using a newer version of Excel?
Having worked on an NHS project at the start of Covid the one thing that's not being said is the pressure from on-high to deliver results in very short timescales
All the talk of setting up this in the cloud, configuring that and bingo perfect running system is great but at the end of the day some people have been set the task of delivering ABC by the next day after they'd already delivered XYZ 24 hours earlier
Sometimes chucking it into Excel and getting a person to crunch the data is just the way it gets done to meet the deadline - Not that i think it's right 🙂
Try not to think like that @stevious, try to think more about a large government contractor actually using technology that is fit for the purpose it is going to be used for. Even if Excel had the ability to use an infinite number of rows, it is still the wrong choice for this purpose.
I mean, I once ‘accidentally’ took out a large part of of the mobile network of a European capital city for an hour because I thought I was typing in another window. How we laughed (later, much later)
I did that once in China in the early hours of the morning. Typed 'shutdown -h now'* in a shell window thinking it was my local console - it wasn't. Cue a very frantic night of finding the guy with the key to the building where I'd just shut down the cell site, going to his house, waking him up at 2am, shoving him in a taxi and driving to 'work', wandering round a huge skyscraper in the dark trying to locate a comms room on the Nth floor. Finally locating the kit and restarting it. All through my assistant as I didn't speak a work of Chinese and no one, bar her, spoke a word of English.
* This was in the days when if you put RedHat on a laptop it didn't have any integration with power buttons etc and the only way to shut it down was by command line.