Big Data
 

MegaSack DRAW - This year's winner is user - rgwb
We will be in touch

[Closed] Big Data

125 Posts
40 Users
0 Reactions
234 Views
Posts: 12079
Full Member
 

I'm assuming that you're not heavily involved with BI stuff at the moment? I'd view a foundation in BI as the way to move into something like Big Data. If you have a Java background, some of the end user BI tools use Java and a lot of smaller firms like custom development. Certainly Cognos Report studio is quite lucrative if you can do clever stuff that the standard tool can't do (god knows why though - it makes upgrading a nightmare). The arse has fallen out of the Cognos market though, with contract rates being pretty low (£300'ish a day for a report writer, if you can find a role). Maybe look at Tableau and Qlikview as other possible toolsets that everyone currently wants. Once into reporting, move into the ETL side of things and bob is your mothers brother..

Cheers, that's pretty much what I was thinking - a combination of courses to get the theoretical background, and move slowly into the stuff via Java.


 
Posted : 22/10/2014 12:38 pm
Posts: 0
Free Member
 

I think the best way into it for technical people without much analytics experience is to download R and try and enter some of the Kaggle competitions. You will get an idea of whether you like the work, will make some useful contacts and can put it on your CV if you do well in a few of them.

To me 'Big Data' means machine learning and not just standard BI on big datasets.


 
Posted : 22/10/2014 3:34 pm
Posts: 23296
Free Member
 

I think the best way into it for technical people without much analytics experience is to download R and try and enter some of the Kaggle competitions. You will get an idea of whether you like the work, will make some useful contacts and can put it on your CV if you do well in a few of them.

thats kinda what I had in mind.


 
Posted : 22/10/2014 3:36 pm
Posts: 0
Free Member
 

For those getting into Data and Analytics I'd recommend brushing up on your maths and stats. A lot of what we do involves regression analysis, stochastic modelling, cluster analysis, sentiment analysis (for customer data) and the mathematical modelling required to deliver the insights is often of greater importance to a client than the toolset. Prototyping in Excel is still accepted but real understanding of SAS and R are in demand, particularly if you can couple it with a visualisation tool like Tableau, Qlickview or SAS VA.

I'm hiring if anyone wants a job - Mail in profile. Serious contenders only. Looking for someone who can deliver, evangelise and help me further develop our propositions... £60k to £80k depending on experience and Leeds based.


 
Posted : 22/10/2014 3:44 pm
Posts: 91097
Free Member
Topic starter
 

Hah, this tells a story. Looking for big data sets online I found US census data.

1980 - 5Gb
1990 - 50Gb
2000 - 200Gb

🙂


 
Posted : 23/10/2014 1:02 pm
Posts: 12079
Full Member
 

For those getting into Data and Analytics I'd recommend brushing up on your maths and stats. A lot of what we do involves regression analysis, stochastic modelling, cluster analysis, sentiment analysis (for customer data) and the mathematical modelling required to deliver the insights is often of greater importance to a client than the toolset. Prototyping in Excel is still accepted but real understanding of SAS and R are in demand, particularly if you can couple it with a visualisation tool like Tableau, Qlickview or SAS VA.

I've been doing Coursera courses like mad, definitely need to brush up on stats.... R itself, however, is very easy when you've got a programming background.


 
Posted : 24/10/2014 6:29 am
Posts: 0
Free Member
 

Hah, this tells a story. Looking for big data sets online I found US census data.

I suppose you could also factor processing time requirements into the equation - smaller data sets become 'big' when you have time restrictions on analyzing it.

Processing seems to mostly be done using MapReduce type technology (simple key-value databases and eventually consistent technology) and then some ETL needed to perform BI analysis from a relational data source, or at least that is my impression.

Oreilly have several 'Big Data' and data analysis books - if you have an account or are on certain of their email groups, they offer 50% off their ebooks at fairly regular intervals.


 
Posted : 24/10/2014 7:20 am
Posts: 3317
Full Member
 

This:

get clients to think about the outcomes they want from their data

And like some have said or suggested, big data remains just that without the right business thinking and mathematical insight.

Something I hadn't realised was that I've been around data long enough that the things we used to do because we had to have now turned into an 'exciting' industry. Looks like there might be more varied employment opportunities in the future than I'd reckoned on.

Anyone playing with 'big data' with Neo4J & cypher rather than NOSQL?


 
Posted : 24/10/2014 7:46 am
Posts: 3317
Full Member
 

@mogrim yeees, R is 'easy' if you've a programming background but its syntax sometimes seems more cryptic than, say, SAS, C, or Python. Perhaps 'cryptic' is harsh, maybe I mean 'concise'?


 
Posted : 24/10/2014 7:49 am
Posts: 13594
Free Member
 

1980 - 5Gb
1990 - 50Gb
2000 - 200Gb

is 200Gb really big these days esp Census stuff where you have all the time in the world to mine it?


 
Posted : 24/10/2014 7:58 am
Posts: 91097
Free Member
Topic starter
 

No it's not, I was just commenting on how it's grown over the years. Of course there aren't that many more people in the US, they've just stored more data because they can.

I've been working on a fairly academic proof of concept to demonstrate an idea I've had to show off our products, and in order to do this I need a sample data set. But in attempting to find the right one I've actually demonstrated to myself all the things that have been mentioned in this thread by the analyst folk, but in a sort of backwards way - rather than choosing the questions carefully I've already got a style of question and I need to find the data set that will contain non trivial answers 🙂


 
Posted : 24/10/2014 8:04 am
Posts: 3317
Full Member
 

@molgrips unsure what your PoC idea is, but the UK[url= http://data.gov.uk/data/search ] open.gov[/url] initiative, the [url= http://www.opengov.se/sidor/english/ ]Swedish[/url] analogs, and the [url= https://open.fda.gov ]FDA[/url]'s data sources are 'big' in that there's a fair volume of data. What I find exciting about them though is the ontological challenges even in allegedly the same domain. I'd consider that the biggest 'big data' challenge.


 
Posted : 24/10/2014 8:15 am
Posts: 91097
Free Member
Topic starter
 

The idea is business rules related with conditions that match items in large data sets. So if it were say social media data, I'd want to say 'collect all the peopel who've mentioned Acme Corp in the last week, and then if any of these people have also complained about CRC then send them an email' or similar. Or if it were say financial transactions I could say 'if there are any accounts that have had three transactions under £1 from online retailers on this list then raise a flag' that kind of thing.

It needs to be multiple levels of searching to whittle down a big data set into a non-big subset that I can then work with. I had a look at US Census data but it's going to take far too long to figure it out. I also want it in some kind of flat file CSV or something like that.


 
Posted : 24/10/2014 9:01 am
Posts: 0
Free Member
 

Folks seem to be conflating Big Data and big data sets. Having lots of data which you set out to collect, is not Big Data. What separates Big Data from most of the descriptions here is the Big Data is the data taht is collected incidentally, as a by-product of other business processes.


 
Posted : 24/10/2014 9:22 am
Posts: 91097
Free Member
Topic starter
 

What separates Big Data from most of the descriptions here is the Big Data is the data taht is collected incidentally, as a by-product of other business processes.

Don't agree. Otherwise it'd be called Incidental Data not Big Data.

If it's too big for a relational DB, it's big and it's data so it's Big Data.


 
Posted : 24/10/2014 9:23 am
Posts: 0
Free Member
 

[url= http://www.adambarker.org/papers/bigdata_definition.pdf ]Here's a review paper by a couple of my colleagues that reviews some of the differing interpretations of "Big Data".[/url] Their conclusion


Despite the range and dierences existing within each of the
aforementioned denitions there are some points of similar-
ity. Notably all defnitions make at least one of the following assertions:
Size:
the volume of the datasets is a critical factor.
Complexity:
the structure, behaviour and permutations of
the datasets is a critical factor.
Technologies:
the tools and techniques which are used to process a sizable or complex dataset is a critical factor

An extrapolation of these factors would therefore postulate the following:
Big data is a term describing the storage and analysis of large
and or complex data sets using a series of techniques includ-
ing, but not limited to: NoSQL, MapReduce and machine learning.


 
Posted : 24/10/2014 9:26 am
 dazh
Posts: 13302
Full Member
 

If it's too big for a relational DB, it's big and it's data so it's Big Data.

Where I work a lot of the engineers think that relational DBs have the same limitations as excel spreadsheets. I actually had an engineer tell me I should be using big data tools for a dataset which contains a few million records because it's too big for a database.


 
Posted : 24/10/2014 9:33 am
Posts: 0
Free Member
 

Don't agree. Otherwise it'd be called Incidental Data not Big Data.

yeah, very good point, I hadn't thought of that!

But then following your logic, it would just be called too much data for a realtional DB


 
Posted : 24/10/2014 9:38 am
Posts: 91097
Free Member
Topic starter
 

Yeah, or Big for short 🙂


 
Posted : 24/10/2014 9:48 am
Posts: 0
Free Member
 

But then following your logic, it would just be called too much data for a realtional DB

but what if you set up the data on a relational db such that is structured as just key/pair, with the key as the clustering index.

What is the difference between that and a NoSql database - not much, just the way the data is queried, not the amount of data that the database can handle.


 
Posted : 24/10/2014 9:57 am
Posts: 91097
Free Member
Topic starter
 

What is the difference between that and a NoSql database

Well a NoSQL database is backed by a distributed file system and working on it it is a map/reduce job. So you can increase the number of nodes more or less indefinitely and store any amount of data.

Can you put a petabyte of data in a relational DB?

With Hadoop, the data lives on the nodes, and the code is run where the data is. That's different to a relational DB afaik.


 
Posted : 24/10/2014 10:02 am
Posts: 13594
Free Member
 

Have to agree that Size is not all. If you work with GIS tools, high res maps can be 100s Gb, but the complexity is trivial, so it's not hard at all to process them...


 
Posted : 24/10/2014 10:37 am
Posts: 0
Free Member
 

Well a NoSQL database is backed by a distributed file system and working on it it is a map/reduce job.

that's an implementation detail of the database, a relational database could store its data in a distributed manner if it was written to.

a relational database with a key and a single piece of data is fundamentally the same as most of these 'nosql' databases.

map/reduce is just a technique - you could use this same technique on key/pair data stored in a relational database if you wanted to.

At the end of the map/reduce stages you will also have a lot less data to analyse, and many of the BI tools run with a relational database still and so you will need to ETL into them.


 
Posted : 24/10/2014 10:59 am
Posts: 91097
Free Member
Topic starter
 

Well that's fine then. Tools for jobs 🙂

Incidentally IBM have an SQL database that's backed by Hadoop.


 
Posted : 24/10/2014 11:03 am
Posts: 0
Free Member
 

A problem with all this Big Data stuff and the consequent explosion in cloud usage which has accompanied it is the eventual consistency aspect of it, which can make writing transactional systems challenging as clouds don't easily conform to the CAP theorem.

I have also seen cases where people have lost data for 24 hours ( http://www.stackdriver.com/eventual-consistency-really-eventual/), which doesn't matter so much when that data is 'Big' and somewhat redundant, but does matter in many situations if you 'lose' a queued message for a long period.

Microsofts Azure seems to me to be a better infrastructure for cloud usage than AWS as they seem to have put some serious thought into this problem.


 
Posted : 24/10/2014 11:18 am
Posts: 8024
Full Member
 

Incidentally IBM have an SQL database that's backed by Hadoop.

Not sure they do, isn't it an sql engine that can query Hadoop.


 
Posted : 24/10/2014 1:33 pm
Posts: 0
Free Member
 

Microsofts Azure seems to me to be a better infrastructure for cloud usage than AWS as they seem to have put some serious thought into this problem.

The Azure platform does look pretty impressive. I've not played with it but the infrastructure looks sound and their front end tools for making use of your data look super easy to use.


 
Posted : 24/10/2014 2:07 pm
Posts: 91097
Free Member
Topic starter
 

Big SQL for Version 3.0 and later, is a massively parallel processing (MPP) SQL engine that deploys directly on the physical Hadoop Distributed File System (HDFS) cluster.

This SQL engine pushes processing down to the same nodes that hold the data. Big SQL uses a low-latency parallel execution infrastructure that accesses Hadoop data natively for reading and writing.


 
Posted : 24/10/2014 2:13 pm
Posts: 8024
Full Member
 

Exactly, an sql engine that queries Hadoop. Not a SQL DB with Hadoop behind it.


 
Posted : 24/10/2014 3:34 pm
Posts: 0
Free Member
 

Exactly, an sql engine that queries Hadoop. Not a SQL DB with Hadoop behind it.

and what is the difference between that and a proper SQL database that just stores data as a key and one big column holding the data?

At the lowest level of a database you have the leaf nodes of something like a B-Tree with a clustering index, which is going to be pretty similar between a relational database and a KV no-sql database, so you just have how the management of the leaf node storage is implemented.


 
Posted : 24/10/2014 3:50 pm
Posts: 0
Free Member
 

Interesting paper on how Azure offers strong consistency for storage writes, which Amazon can't do.


 
Posted : 24/10/2014 4:07 pm
Posts: 91097
Free Member
Topic starter
 

and what is the difference between that and a proper SQL database that just stores data as a key and one big column holding the data?

How many nodes can your DB cluster have? How much data can you manage? When you want to transform data rather than simply retrieve it, how can you do that on all the nodes at once?

Genuine questions btw, not arguing with you.


 
Posted : 24/10/2014 4:13 pm
Posts: 91097
Free Member
Topic starter
 

Exactly, an sql engine that queries Hadoop. Not a SQL DB with Hadoop behind it.

What's the difference then?


 
Posted : 24/10/2014 4:16 pm
Posts: 0
Free Member
 

Genuine questions btw, not arguing with you.

just trying to say that Big Data is not a 'revolution', just an evolution.

The problems of the size of the data and distributed computing with commodity hardware have meant the no-sql databases have all the distributed filesystem and tool support, but there is little technically to say that normal relational databases couldn't also have this, if the problem domain demanded it.

But with a relational database you normally have strong consistency, which the whole no-sql and toolsets fail badly at.

I don't do Big Data but was just looking at cloud technology recently when we were considering how/when/if we might move our system to it.

Azure looks a lot more viable partly due to the above paper.

And you can run Azure in your own data centre, so if your system is transaction/strongly consistent/performant then it looks best to me to architect your system to run on Azure in house and then do that 'cloud-bursting' thing when you need extra capacity.

If you go solely cloud then you have more issues with keeping failover sites/clouds up to date due to latencies across the clouds.


 
Posted : 24/10/2014 4:25 pm
Posts: 0
Free Member
 

When you want to transform data rather than simply retrieve it, how can you do that on all the nodes at once?

you can cluster relational databases now anyway and they will apply the same changes to all databases - even MySQL will do it.


 
Posted : 24/10/2014 4:26 pm
Posts: 91097
Free Member
Topic starter
 

Ok.. but is a clustered db the same as a hadoop cluster? I'm not so sure. Does a cluster have the same data on every node?

But with a relational database you normally have strong consistency, which the whole no-sql and toolsets fail badly at.

HBase, the NoSQL DB you are probably thinking of, is in no way a replacement for a relational DB. I assume you are DBA, you are talking like someone who feels the market is overlooking the are in which you are skilled.. 🙂

just trying to say that Big Data is not a 'revolution', just an evolution.

You'll not hear anything from me to say otherwise. But Hadoop doesn't evolve from relational DBs, it comes from what used to be called parallel computing in the scientific world.


 
Posted : 24/10/2014 4:40 pm
Posts: 0
Free Member
 

I assume you are DBA

no, programmer - 10 years C, 12 years C++, 5 years C#

I like relational dbs but not at all tied to them at all.

Just wary of the waves of hype that come along at regular intervals in the computing industry...


 
Posted : 24/10/2014 4:49 pm
Posts: 91097
Free Member
Topic starter
 

Well never mind the hype, I'm only interested in the technology 🙂 Tools for jobs.

Does a clustered DB replicate the same data on all nodes? Cos if it does, that's a major difference.


 
Posted : 24/10/2014 5:06 pm
Posts: 0
Free Member
 

A clustered db would, but with some relational dbs you can partition on the clustering key so the data is on different storage devices, which in a network file system would mean that the different device's could be on different machines maybe.

So the same idea probably.


 
Posted : 26/10/2014 11:49 am
Posts: 91097
Free Member
Topic starter
 

Not really. With Hadoop the data is distributed around all nodes, with some duplication for redundancy. The Java code is run on the nodes themselves and then the results are combined.


 
Posted : 26/10/2014 6:05 pm
Posts: 0
Free Member
 

SQL alike systems achieving parallel ACID transactions?

I find these debates start up frequently with a cloud/bigdata/foss themewash - but personally I have never seen one outrun the constraints of ACID + CAP. Very keen to hear from anyone 'doing it' though.

I guess it all depends if 'close' is good enough though - normally find for genuinely transactional DB's - it is not. But for decision support, research, analytics it often is.

http://en.wikipedia.org/wiki/CAP_theorem

(strong, light, cheap trade off as ever?)

It's interesting that some of these techniques/techs are used in health informatics - but I also understand this is exploratory vs diagnostic/critical.

Edit: Hive, CLoudera + a few others give SQL like API to Hadoop. Did anyone think it means it's now an SQL alike system though (I wonder).

Interesting topic and some great comments!


 
Posted : 26/10/2014 6:59 pm
Posts: 91097
Free Member
Topic starter
 

I don't think you can compare hadoop etc to an rdbms. Not really the same thing at all.


 
Posted : 26/10/2014 7:11 pm
Posts: 0
Free Member
 

That's not what the hadoop vendors will have you believe molgrips. The new technology opens up some great opportunities to provide business value, but is also disruptive as vendors make unrealistic claims on what their technology can do, and teccies wanting to kick tyres and get all the latest buzzwords on their linked-in profile.


 
Posted : 26/10/2014 7:31 pm
Posts: 0
Free Member
 

I agree.

To me it's = sht and shinola - to others it may be labelled a mere quibble from the jurassic specialists though ...

After the dust settles the requests to ACID'ify the cloudwashed solution - using infrastructure magic (TM) - might even start a whole new round for 'emerging' ACID compliant RDBMS tech solutions - anyone care to coin some phrases/TLA's/buzzwords at this early alpha phase? Whilst we go round the buoy again..


 
Posted : 26/10/2014 7:39 pm
Posts: 0
Free Member
 

Data wrangling seems to be a current buzzword doing the rounds bonchance.

I mentioned @BigDataBoart(Learnings of Big Data for Make Nation of Kazakhstan #1 Leading Data Scientist Nation) eralier in the thread, and I particularly like this tweet:

Vendor tell me they are ANSI SQL compliant: They comply with every part that was easy to implement.


 
Posted : 26/10/2014 7:53 pm
Posts: 91097
Free Member
Topic starter
 

That's not what the hadoop vendors will have you believe molgrips.

Lol quite.. Salesmen in bullshit shocker.. Glad I don't deal with them 🙂


 
Posted : 26/10/2014 10:49 pm
Page 2 / 2