Data science
 

Subscribe now and choose from over 30 free gifts worth up to £49 - Plus get £25 to spend in our shop

[Closed] Data science

12 Posts
9 Users
0 Reactions
81 Views
Posts: 363
Free Member
Topic starter
 

Anyone on here work in Data Science, specifically with R?

I'm workin on a project to get data science going, and we are running into some issues. My background is in data warehousing, workin with MS SQL, SSIS and SSAS. The data scientist we have is not great technically, so there is a disconnect.

Basically he is running out of RAM when running his queries in R, even with a server with 140GB of RAM. from what I understand, some stats functions end up duplicating the data, meaning you need a huge amount more ram than the data set? This seems illogical to me as a database person.
We are looking a revolution analytics, now MS, or Spark, but I don't think we actually have that big a data set compared to what people using Hadoop and Spark are talking about, so I don't understand why this is so difficult.
Any pointers?


 
Posted : 03/03/2016 10:13 pm
Posts: 17
Free Member
 

Damm can't find the good What is R Slide....

(Something Pirates Say)

Nothing as useful as have you tried some light reading?
https://cran.r-project.org/manuals.html


 
Posted : 03/03/2016 10:19 pm
Posts: 363
Free Member
Topic starter
 

I have, the problem is I don't really get along with stats, so when they start talking about the functions and examples, it just makes my head hurt.
I work very well with data, and databases, but this area gets into the stats so quickly that I get lost.

Plus, there isn't a section titled "R memory management"


 
Posted : 03/03/2016 10:24 pm
Posts: 17
Free Member
 

just having my getting going at work cup of tea so really just looking out of curiosity (and in case I have to deal with it) but
http://adv-r.had.co.nz/memory.html
any use? I'd probably try an R Wiki/forum for specifics or tell the user to sort it out 😉


 
Posted : 03/03/2016 10:31 pm
Posts: 0
Free Member
 

I hate R so very much.

I am of no help to you.


 
Posted : 03/03/2016 10:36 pm
Posts: 11
Free Member
 

The only thing I am aware of in R is that you should have approximately three times as much RAM as the dataset. Since I've got to start writing some infrastructure architect docs on this soon could you confirm if that is true 😀

I suppose you could always create a temporary monster server in AWS or Azure to find out just how much RAM his query needs - it might be an expensive way to find out...


 
Posted : 03/03/2016 10:51 pm
Posts: 17254
Full Member
 

SAS. I hate r. You get what you pay for. I know that is no help. If it is just descriptive stats I might write my own code instead.


 
Posted : 04/03/2016 12:47 am
Posts: 4954
Free Member
 

You can rewrite computational y heavy part in c/c++ and call from R. Simples.

I bet though in R there are options to define if you do deep copies or reference copies or work in sequence through the data or load all data at once and work on entire data set etc.

Sound more like an issue of knowledge from R guy than a R limitations as I've heard of R being used on very big data sets.


 
Posted : 04/03/2016 6:44 am
Posts: 363
Free Member
Topic starter
 

It is not just descriptive stats, he is doing some pretty heavy analysis of our data.

C/C++ is not going to happen, and I'm sure you can code R better to stop deep copies, but, from what I understand, if you are doing certain things it will copy the data.

Another question, one way to reduce the data size would be to use ID's instead of the name (eg CountryKey = 1 rather than CountryName = 'United Kingdom'). This will make the table of values much smaller. Then, once you have the results, join them to the country table to get the descriptive name. Is something like this possible in R?


 
Posted : 04/03/2016 8:31 am
Posts: 34055
Full Member
 

I know f all about programming, but as a scientist or bioinformatics guys use R, to do all the analyses in big datasets from genome sequencing, is it worth looking to see how that's done ?


 
Posted : 04/03/2016 8:57 am
Posts: 363
Free Member
Topic starter
 

I found this if anyone is interested:
http://theodi.org/blog/fig-data-11-tips-how-handle-big-data-r-and-1-bad-pun


 
Posted : 04/03/2016 11:22 am
Posts: 12077
Full Member
 

AFAIK from the couple of R courses I've done: R just sucks everything into memory, and works on that. So if you've got a big data structure, then copy it, then slice it and dice it.... all into memory.

If you can set up a pipeline: use a perl/python/ruby script to clean up the data, then do process 1 and output to disk, process 2, etc. etc. It also makes it easier to debug.


 
Posted : 04/03/2016 11:28 am
Posts: 300
Full Member
 

We use Perl/Python for all our NGS bioinformatics rather than R. Its a painful language.

Sounds like you need to set some hard memory limits or stream the data?

This looked a useful link from the [url= https://stat.ethz.ch/R-manual/R-devel/library/base/html/Memory-limits.html ]R docs[/url].

I assume they've looked on the font of all knowledge that is StackOverflow as well?


 
Posted : 04/03/2016 11:38 am