Viewing 13 posts - 1 through 13 (of 13 total)
  • Data science
  • Shred
    Free Member

    Anyone on here work in Data Science, specifically with R?

    I’m workin on a project to get data science going, and we are running into some issues. My background is in data warehousing, workin with MS SQL, SSIS and SSAS. The data scientist we have is not great technically, so there is a disconnect.

    Basically he is running out of RAM when running his queries in R, even with a server with 140GB of RAM. from what I understand, some stats functions end up duplicating the data, meaning you need a huge amount more ram than the data set? This seems illogical to me as a database person.
    We are looking a revolution analytics, now MS, or Spark, but I don’t think we actually have that big a data set compared to what people using Hadoop and Spark are talking about, so I don’t understand why this is so difficult.
    Any pointers?

    mikewsmith
    Free Member

    Damm can’t find the good What is R Slide….

    (Something Pirates Say)

    Nothing as useful as have you tried some light reading?
    https://cran.r-project.org/manuals.html

    Shred
    Free Member

    I have, the problem is I don’t really get along with stats, so when they start talking about the functions and examples, it just makes my head hurt.
    I work very well with data, and databases, but this area gets into the stats so quickly that I get lost.

    Plus, there isn’t a section titled “R memory management”

    mikewsmith
    Free Member

    just having my getting going at work cup of tea so really just looking out of curiosity (and in case I have to deal with it) but
    http://adv-r.had.co.nz/memory.html
    any use? I’d probably try an R Wiki/forum for specifics or tell the user to sort it out 😉

    GregMay
    Free Member

    I hate R so very much.

    I am of no help to you.

    BigEaredBiker
    Free Member

    The only thing I am aware of in R is that you should have approximately three times as much RAM as the dataset. Since I’ve got to start writing some infrastructure architect docs on this soon could you confirm if that is true 😀

    I suppose you could always create a temporary monster server in AWS or Azure to find out just how much RAM his query needs – it might be an expensive way to find out…

    TiRed
    Full Member

    SAS. I hate r. You get what you pay for. I know that is no help. If it is just descriptive stats I might write my own code instead.

    TheBrick
    Free Member

    You can rewrite computational y heavy part in c/c++ and call from R. Simples.

    I bet though in R there are options to define if you do deep copies or reference copies or work in sequence through the data or load all data at once and work on entire data set etc.

    Sound more like an issue of knowledge from R guy than a R limitations as I’ve heard of R being used on very big data sets.

    Shred
    Free Member

    It is not just descriptive stats, he is doing some pretty heavy analysis of our data.

    C/C++ is not going to happen, and I’m sure you can code R better to stop deep copies, but, from what I understand, if you are doing certain things it will copy the data.

    Another question, one way to reduce the data size would be to use ID’s instead of the name (eg CountryKey = 1 rather than CountryName = ‘United Kingdom’). This will make the table of values much smaller. Then, once you have the results, join them to the country table to get the descriptive name. Is something like this possible in R?

    kimbers
    Full Member

    I know f all about programming, but as a scientist or bioinformatics guys use R, to do all the analyses in big datasets from genome sequencing, is it worth looking to see how that’s done ?

    Shred
    Free Member
    mogrim
    Full Member

    AFAIK from the couple of R courses I’ve done: R just sucks everything into memory, and works on that. So if you’ve got a big data structure, then copy it, then slice it and dice it…. all into memory.

    If you can set up a pipeline: use a perl/python/ruby script to clean up the data, then do process 1 and output to disk, process 2, etc. etc. It also makes it easier to debug.

    sboardman
    Full Member

    We use Perl/Python for all our NGS bioinformatics rather than R. Its a painful language.

    Sounds like you need to set some hard memory limits or stream the data?

    This looked a useful link from the R docs.

    I assume they’ve looked on the font of all knowledge that is StackOverflow as well?

Viewing 13 posts - 1 through 13 (of 13 total)

The topic ‘Data science’ is closed to new replies.