Home › Forums › Chat Forum › Big Data
- This topic has 125 replies, 40 voices, and was last updated 10 years ago by molgrips.
-
Big Data
-
mogrimFull Member
I’m assuming that you’re not heavily involved with BI stuff at the moment? I’d view a foundation in BI as the way to move into something like Big Data. If you have a Java background, some of the end user BI tools use Java and a lot of smaller firms like custom development. Certainly Cognos Report studio is quite lucrative if you can do clever stuff that the standard tool can’t do (god knows why though – it makes upgrading a nightmare). The arse has fallen out of the Cognos market though, with contract rates being pretty low (£300’ish a day for a report writer, if you can find a role). Maybe look at Tableau and Qlikview as other possible toolsets that everyone currently wants. Once into reporting, move into the ETL side of things and bob is your mothers brother..
Cheers, that’s pretty much what I was thinking – a combination of courses to get the theoretical background, and move slowly into the stuff via Java.
Steve77Free MemberI think the best way into it for technical people without much analytics experience is to download R and try and enter some of the Kaggle competitions. You will get an idea of whether you like the work, will make some useful contacts and can put it on your CV if you do well in a few of them.
To me ‘Big Data’ means machine learning and not just standard BI on big datasets.
jam-boFull MemberI think the best way into it for technical people without much analytics experience is to download R and try and enter some of the Kaggle competitions. You will get an idea of whether you like the work, will make some useful contacts and can put it on your CV if you do well in a few of them.
thats kinda what I had in mind.
wonderchumpFree MemberFor those getting into Data and Analytics I’d recommend brushing up on your maths and stats. A lot of what we do involves regression analysis, stochastic modelling, cluster analysis, sentiment analysis (for customer data) and the mathematical modelling required to deliver the insights is often of greater importance to a client than the toolset. Prototyping in Excel is still accepted but real understanding of SAS and R are in demand, particularly if you can couple it with a visualisation tool like Tableau, Qlickview or SAS VA.
I’m hiring if anyone wants a job – Mail in profile. Serious contenders only. Looking for someone who can deliver, evangelise and help me further develop our propositions… £60k to £80k depending on experience and Leeds based.
molgripsFree MemberHah, this tells a story. Looking for big data sets online I found US census data.
1980 – 5Gb
1990 – 50Gb
2000 – 200Gb🙂
mogrimFull MemberFor those getting into Data and Analytics I’d recommend brushing up on your maths and stats. A lot of what we do involves regression analysis, stochastic modelling, cluster analysis, sentiment analysis (for customer data) and the mathematical modelling required to deliver the insights is often of greater importance to a client than the toolset. Prototyping in Excel is still accepted but real understanding of SAS and R are in demand, particularly if you can couple it with a visualisation tool like Tableau, Qlickview or SAS VA.
I’ve been doing Coursera courses like mad, definitely need to brush up on stats…. R itself, however, is very easy when you’ve got a programming background.
TurnerGuyFree MemberHah, this tells a story. Looking for big data sets online I found US census data.
I suppose you could also factor processing time requirements into the equation – smaller data sets become ‘big’ when you have time restrictions on analyzing it.
Processing seems to mostly be done using MapReduce type technology (simple key-value databases and eventually consistent technology) and then some ETL needed to perform BI analysis from a relational data source, or at least that is my impression.
Oreilly have several ‘Big Data’ and data analysis books – if you have an account or are on certain of their email groups, they offer 50% off their ebooks at fairly regular intervals.
prettygreenparrotFull MemberThis:
get clients to think about the outcomes they want from their data
And like some have said or suggested, big data remains just that without the right business thinking and mathematical insight.
Something I hadn’t realised was that I’ve been around data long enough that the things we used to do because we had to have now turned into an ‘exciting’ industry. Looks like there might be more varied employment opportunities in the future than I’d reckoned on.
Anyone playing with ‘big data’ with Neo4J & cypher rather than NOSQL?
prettygreenparrotFull Member@mogrim yeees, R is ‘easy’ if you’ve a programming background but its syntax sometimes seems more cryptic than, say, SAS, C, or Python. Perhaps ‘cryptic’ is harsh, maybe I mean ‘concise’?
footflapsFull Member1980 – 5Gb
1990 – 50Gb
2000 – 200Gbis 200Gb really big these days esp Census stuff where you have all the time in the world to mine it?
molgripsFree MemberNo it’s not, I was just commenting on how it’s grown over the years. Of course there aren’t that many more people in the US, they’ve just stored more data because they can.
I’ve been working on a fairly academic proof of concept to demonstrate an idea I’ve had to show off our products, and in order to do this I need a sample data set. But in attempting to find the right one I’ve actually demonstrated to myself all the things that have been mentioned in this thread by the analyst folk, but in a sort of backwards way – rather than choosing the questions carefully I’ve already got a style of question and I need to find the data set that will contain non trivial answers 🙂
prettygreenparrotFull Member@molgrips unsure what your PoC idea is, but the UK open.gov initiative, the Swedish[/url] analogs, and the FDA‘s data sources are ‘big’ in that there’s a fair volume of data. What I find exciting about them though is the ontological challenges even in allegedly the same domain. I’d consider that the biggest ‘big data’ challenge.
molgripsFree MemberThe idea is business rules related with conditions that match items in large data sets. So if it were say social media data, I’d want to say ‘collect all the peopel who’ve mentioned Acme Corp in the last week, and then if any of these people have also complained about CRC then send them an email’ or similar. Or if it were say financial transactions I could say ‘if there are any accounts that have had three transactions under £1 from online retailers on this list then raise a flag’ that kind of thing.
It needs to be multiple levels of searching to whittle down a big data set into a non-big subset that I can then work with. I had a look at US Census data but it’s going to take far too long to figure it out. I also want it in some kind of flat file CSV or something like that.
CharlieMungusFree MemberFolks seem to be conflating Big Data and big data sets. Having lots of data which you set out to collect, is not Big Data. What separates Big Data from most of the descriptions here is the Big Data is the data taht is collected incidentally, as a by-product of other business processes.
molgripsFree MemberWhat separates Big Data from most of the descriptions here is the Big Data is the data taht is collected incidentally, as a by-product of other business processes.
Don’t agree. Otherwise it’d be called Incidental Data not Big Data.
If it’s too big for a relational DB, it’s big and it’s data so it’s Big Data.
whatnobeerFree MemberHere’s a review paper by a couple of my colleagues that reviews some of the differing interpretations of “Big Data”. Their conclusion
Despite the range and dierences existing within each of the
aforementioned denitions there are some points of similar-
ity. Notably all defnitions make at least one of the following assertions:
Size:
the volume of the datasets is a critical factor.
Complexity:
the structure, behaviour and permutations of
the datasets is a critical factor.
Technologies:
the tools and techniques which are used to process a sizable or complex dataset is a critical factorAn extrapolation of these factors would therefore postulate the following:
Big data is a term describing the storage and analysis of large
and or complex data sets using a series of techniques includ-
ing, but not limited to: NoSQL, MapReduce and machine learning.dazhFull MemberIf it’s too big for a relational DB, it’s big and it’s data so it’s Big Data.
Where I work a lot of the engineers think that relational DBs have the same limitations as excel spreadsheets. I actually had an engineer tell me I should be using big data tools for a dataset which contains a few million records because it’s too big for a database.
CharlieMungusFree MemberDon’t agree. Otherwise it’d be called Incidental Data not Big Data.
yeah, very good point, I hadn’t thought of that!
But then following your logic, it would just be called too much data for a realtional DB
TurnerGuyFree MemberBut then following your logic, it would just be called too much data for a realtional DB
but what if you set up the data on a relational db such that is structured as just key/pair, with the key as the clustering index.
What is the difference between that and a NoSql database – not much, just the way the data is queried, not the amount of data that the database can handle.
molgripsFree MemberWhat is the difference between that and a NoSql database
Well a NoSQL database is backed by a distributed file system and working on it it is a map/reduce job. So you can increase the number of nodes more or less indefinitely and store any amount of data.
Can you put a petabyte of data in a relational DB?
With Hadoop, the data lives on the nodes, and the code is run where the data is. That’s different to a relational DB afaik.
footflapsFull MemberHave to agree that Size is not all. If you work with GIS tools, high res maps can be 100s Gb, but the complexity is trivial, so it’s not hard at all to process them…
TurnerGuyFree MemberWell a NoSQL database is backed by a distributed file system and working on it it is a map/reduce job.
that’s an implementation detail of the database, a relational database could store its data in a distributed manner if it was written to.
a relational database with a key and a single piece of data is fundamentally the same as most of these ‘nosql’ databases.
map/reduce is just a technique – you could use this same technique on key/pair data stored in a relational database if you wanted to.
At the end of the map/reduce stages you will also have a lot less data to analyse, and many of the BI tools run with a relational database still and so you will need to ETL into them.
molgripsFree MemberWell that’s fine then. Tools for jobs 🙂
Incidentally IBM have an SQL database that’s backed by Hadoop.
TurnerGuyFree MemberA problem with all this Big Data stuff and the consequent explosion in cloud usage which has accompanied it is the eventual consistency aspect of it, which can make writing transactional systems challenging as clouds don’t easily conform to the CAP theorem.
I have also seen cases where people have lost data for 24 hours (http://www.stackdriver.com/eventual-consistency-really-eventual/), which doesn’t matter so much when that data is ‘Big’ and somewhat redundant, but does matter in many situations if you ‘lose’ a queued message for a long period.
Microsofts Azure seems to me to be a better infrastructure for cloud usage than AWS as they seem to have put some serious thought into this problem.
nixieFull MemberIncidentally IBM have an SQL database that’s backed by Hadoop.
Not sure they do, isn’t it an sql engine that can query Hadoop.
whatnobeerFree MemberMicrosofts Azure seems to me to be a better infrastructure for cloud usage than AWS as they seem to have put some serious thought into this problem.
The Azure platform does look pretty impressive. I’ve not played with it but the infrastructure looks sound and their front end tools for making use of your data look super easy to use.
molgripsFree MemberBig SQL for Version 3.0 and later, is a massively parallel processing (MPP) SQL engine that deploys directly on the physical Hadoop Distributed File System (HDFS) cluster.
This SQL engine pushes processing down to the same nodes that hold the data. Big SQL uses a low-latency parallel execution infrastructure that accesses Hadoop data natively for reading and writing.
nixieFull MemberExactly, an sql engine that queries Hadoop. Not a SQL DB with Hadoop behind it.
TurnerGuyFree MemberExactly, an sql engine that queries Hadoop. Not a SQL DB with Hadoop behind it.
and what is the difference between that and a proper SQL database that just stores data as a key and one big column holding the data?
At the lowest level of a database you have the leaf nodes of something like a B-Tree with a clustering index, which is going to be pretty similar between a relational database and a KV no-sql database, so you just have how the management of the leaf node storage is implemented.
TurnerGuyFree MemberInteresting paper on how Azure offers strong consistency for storage writes, which Amazon can’t do.
http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf
molgripsFree Memberand what is the difference between that and a proper SQL database that just stores data as a key and one big column holding the data?
How many nodes can your DB cluster have? How much data can you manage? When you want to transform data rather than simply retrieve it, how can you do that on all the nodes at once?
Genuine questions btw, not arguing with you.
molgripsFree MemberExactly, an sql engine that queries Hadoop. Not a SQL DB with Hadoop behind it.
What’s the difference then?
TurnerGuyFree MemberGenuine questions btw, not arguing with you.
just trying to say that Big Data is not a ‘revolution’, just an evolution.
The problems of the size of the data and distributed computing with commodity hardware have meant the no-sql databases have all the distributed filesystem and tool support, but there is little technically to say that normal relational databases couldn’t also have this, if the problem domain demanded it.
But with a relational database you normally have strong consistency, which the whole no-sql and toolsets fail badly at.
I don’t do Big Data but was just looking at cloud technology recently when we were considering how/when/if we might move our system to it.
Azure looks a lot more viable partly due to the above paper.
And you can run Azure in your own data centre, so if your system is transaction/strongly consistent/performant then it looks best to me to architect your system to run on Azure in house and then do that ‘cloud-bursting’ thing when you need extra capacity.
If you go solely cloud then you have more issues with keeping failover sites/clouds up to date due to latencies across the clouds.
TurnerGuyFree MemberWhen you want to transform data rather than simply retrieve it, how can you do that on all the nodes at once?
you can cluster relational databases now anyway and they will apply the same changes to all databases – even MySQL will do it.
molgripsFree MemberOk.. but is a clustered db the same as a hadoop cluster? I’m not so sure. Does a cluster have the same data on every node?
But with a relational database you normally have strong consistency, which the whole no-sql and toolsets fail badly at.
HBase, the NoSQL DB you are probably thinking of, is in no way a replacement for a relational DB. I assume you are DBA, you are talking like someone who feels the market is overlooking the are in which you are skilled.. 🙂
just trying to say that Big Data is not a ‘revolution’, just an evolution.
You’ll not hear anything from me to say otherwise. But Hadoop doesn’t evolve from relational DBs, it comes from what used to be called parallel computing in the scientific world.
TurnerGuyFree MemberI assume you are DBA
no, programmer – 10 years C, 12 years C++, 5 years C#
I like relational dbs but not at all tied to them at all.
Just wary of the waves of hype that come along at regular intervals in the computing industry…
molgripsFree MemberWell never mind the hype, I’m only interested in the technology 🙂 Tools for jobs.
Does a clustered DB replicate the same data on all nodes? Cos if it does, that’s a major difference.
TurnerGuyFree MemberA clustered db would, but with some relational dbs you can partition on the clustering key so the data is on different storage devices, which in a network file system would mean that the different device’s could be on different machines maybe.
So the same idea probably.
molgripsFree MemberNot really. With Hadoop the data is distributed around all nodes, with some duplication for redundancy. The Java code is run on the nodes themselves and then the results are combined.
The topic ‘Big Data’ is closed to new replies.