Data science

Overview Chat Bike Members News Women

This topic has 12 replies, 9 voices, and was last updated 8 years ago by sboardman.

Viewing 13 posts - 1 through 13 (of 13 total)

Data science
Shred
Free Member

Anyone on here work in Data Science, specifically with R?

I’m workin on a project to get data science going, and we are running into some issues. My background is in data warehousing, workin with MS SQL, SSIS and SSAS. The data scientist we have is not great technically, so there is a disconnect.

Basically he is running out of RAM when running his queries in R, even with a server with 140GB of RAM. from what I understand, some stats functions end up duplicating the data, meaning you need a huge amount more ram than the data set? This seems illogical to me as a database person.
We are looking a revolution analytics, now MS, or Spark, but I don’t think we actually have that big a data set compared to what people using Hadoop and Spark are talking about, so I don’t understand why this is so difficult.
Any pointers?

Posted 8 years ago

mikewsmith
Free Member

Damm can’t find the good What is R Slide….

(Something Pirates Say)

Nothing as useful as have you tried some light reading?
https://cran.r-project.org/manuals.html

Posted 8 years ago

Shred
Free Member

I have, the problem is I don’t really get along with stats, so when they start talking about the functions and examples, it just makes my head hurt.
I work very well with data, and databases, but this area gets into the stats so quickly that I get lost.

Plus, there isn’t a section titled “R memory management”

Posted 8 years ago

Fresh Goods Friday 700 - The censored edition

mikewsmith
Free Member

just having my getting going at work cup of tea so really just looking out of curiosity (and in case I have to deal with it) but
http://adv-r.had.co.nz/memory.html
any use? I’d probably try an R Wiki/forum for specifics or tell the user to sort it out 😉

Posted 8 years ago

GregMay
Free Member

I hate R so very much.

I am of no help to you.

Posted 8 years ago

BigEaredBiker
Free Member

The only thing I am aware of in R is that you should have approximately three times as much RAM as the dataset. Since I’ve got to start writing some infrastructure architect docs on this soon could you confirm if that is true 😀

I suppose you could always create a temporary monster server in AWS or Azure to find out just how much RAM his query needs – it might be an expensive way to find out…

Posted 8 years ago

TiRed
Full Member

SAS. I hate r. You get what you pay for. I know that is no help. If it is just descriptive stats I might write my own code instead.

Posted 8 years ago

TheBrick
Free Member

You can rewrite computational y heavy part in c/c++ and call from R. Simples.

I bet though in R there are options to define if you do deep copies or reference copies or work in sequence through the data or load all data at once and work on entire data set etc.

Sound more like an issue of knowledge from R guy than a R limitations as I’ve heard of R being used on very big data sets.

Posted 8 years ago

Shred
Free Member

It is not just descriptive stats, he is doing some pretty heavy analysis of our data.

C/C++ is not going to happen, and I’m sure you can code R better to stop deep copies, but, from what I understand, if you are doing certain things it will copy the data.

Another question, one way to reduce the data size would be to use ID’s instead of the name (eg CountryKey = 1 rather than CountryName = ‘United Kingdom’). This will make the table of values much smaller. Then, once you have the results, join them to the country table to get the descriptive name. Is something like this possible in R?

Posted 8 years ago

kimbers
Full Member

I know f all about programming, but as a scientist or bioinformatics guys use R, to do all the analyses in big datasets from genome sequencing, is it worth looking to see how that’s done ?

Posted 8 years ago

Shred
Free Member

I found this if anyone is interested:
http://theodi.org/blog/fig-data-11-tips-how-handle-big-data-r-and-1-bad-pun

Posted 8 years ago

mogrim
Full Member

AFAIK from the couple of R courses I’ve done: R just sucks everything into memory, and works on that. So if you’ve got a big data structure, then copy it, then slice it and dice it…. all into memory.

If you can set up a pipeline: use a perl/python/ruby script to clean up the data, then do process 1 and output to disk, process 2, etc. etc. It also makes it easier to debug.

Posted 8 years ago

sboardman
Full Member

We use Perl/Python for all our NGS bioinformatics rather than R. Its a painful language.

Sounds like you need to set some hard memory limits or stream the data?

This looked a useful link from the R docs.

I assume they’ve looked on the font of all knowledge that is StackOverflow as well?

Posted 8 years ago