Website data scrapi...
 

Subscribe now and choose from over 30 free gifts worth up to £49 - Plus get £25 to spend in our shop

[Closed] Website data scraping

13 Posts
10 Users
0 Reactions
126 Views
Posts: 166
Free Member
Topic starter
 

So supposing I wanted to gather all the results data sets that are freely provided by the UCI here:
[url= http://www.uci.ch/templates/BUILTIN-NOFRAMES/Template3/layout.asp?MenuId=MjExMg ]UCI[/url]

and download whole sections of the website data at once (ie the whole 2013 Pro Road Season)

Is there any simple way I could do it?
do I need to start writing code?
What code? could I do it in C#?

Cheers


 
Posted : 09/04/2013 9:12 am
 nbt
Posts: 12403
Full Member
 

I'd start by

1) ensuring that you are allowed to redistribute the results - just becasue you can access them, doesn;t mean you are allowed to republish elsewhere

2) learning about coding and deciding what language to learn then working out how to do a job in that language, rather than working out what language to learning to do a job

3) use the RSS feed rather than the website


 
Posted : 09/04/2013 9:21 am
Posts: 0
Free Member
 

http://www.scrapy.org/


 
Posted : 09/04/2013 9:22 am
Posts: 166
Free Member
Topic starter
 

thx, NBT

not planning on re-distributing just playing with some data vis software

I am an IT pro who mostly works in SQL and a smattering of C# but web is not my forte, was just wondering if my c# could be used to do this job (im not gonna teach myself a new language just to gather some cool demo data)

RSS feed would be amazing as I can take them in no problems but they only have RSS for news that I can find not results


 
Posted : 09/04/2013 9:28 am
Posts: 14050
Free Member
 

There is a scraper plugin for Firefox that works pretty well.


 
Posted : 09/04/2013 10:17 am
Posts: 12079
Full Member
 

If all you want to do is play with the data the easiest way is probably just to copy the text you want into a (decent) text editor, then using regexp change the lines to SQL insert statements. For tabular data like this Excel is often an easy tool to use, just insert extra columns between the original table columns, and add the SQL you need (ie the apostrophes, commas, etc.).

If you want to learn screen scraping I'd use footie results - they're weekly, and always have the same format. There are also more data to play with, and it's a lot easier to work out what the next match is.


 
Posted : 09/04/2013 10:18 am
Posts: 12079
Full Member
 

There is a scraper plugin for Firefox that works pretty well.

What's it called?


 
Posted : 09/04/2013 10:19 am
Posts: 349
Free Member
 

You could do it in c# but I suspect that python would be a lot easier. I wrote a script that scraped all the youtube urls for tracks posted in a topic on here, was quite easy in python. I don't have the script anymore though or I'd send it to you!


 
Posted : 09/04/2013 10:20 am
Posts: 14050
Free Member
 

What's it called?

Outwit Hub

Fairly powerful.


 
Posted : 09/04/2013 11:53 am
Posts: 2086
Free Member
 

Without an API to plug into scraping would become a laborious task. It would be easy enough (with PHP) to grab the contents of the page, and then using XPath go through and scrape the data you need.

Although you'd be better off targeting this page and modifying the query vars to suit your requirements:

[url= http://www.uci.infostradasports.com/asp/lib/TheASP.asp?PageID=19004&TaalCode=2&StyleID=0&SportID=102&CompetitionID=-1&EditionID=-1&EventID=-1&GenderID=1&ClassID=1&EventPhaseID=0&Phase1ID=0&Phase2ID=0&CompetitionCodeInv=1&PhaseStatusCode=262280&DerivedEventPhaseID=-1&SeasonID=484&StartDateSort=20121004&EndDateSort=20131020&Detail=1&DerivedCompetitionID=-1&S00=-3&S01=2&S02=1&PageNr0=-1&Cache=8 ]http://www.uci.infostradasports.com/asp/lib/TheASP.asp?PageID=19004&TaalCode=2&StyleID=0&SportID=102&CompetitionID=-1&EditionID=-1&EventID=-1&GenderID=1&ClassID=1&EventPhaseID=0&Phase1ID=0&Phase2ID=0&CompetitionCodeInv=1&PhaseStatusCode=262280&DerivedEventPhaseID=-1&SeasonID=484&StartDateSort=20121004&EndDateSort=20131020&Detail=1&DerivedCompetitionID=-1&S00=-3&S01=2&S02=1&PageNr0=-1&Cache=8[/url]


 
Posted : 09/04/2013 12:51 pm
Posts: 3292
Full Member
 

oooh good a programming thread

If you are already in the msoft world then use C#

Ignore the 'xxxx is much better for this' and go with what you know. xxxxx might be nice, but C# is just as if not more productive. Nowadays there are loads of free to use libraries for this sort of stuff. Install the nuget package manager in your visual studio, and use that to search for for a nuget package that can do html parsing.


 
Posted : 09/04/2013 1:36 pm
Posts: 12079
Full Member
 

sharkbait - Member
Outwit Hub
Fairly powerful.

Cheers, will take a look!

Edit: not compatible with my version of Firefox 🙁


 
Posted : 09/04/2013 1:59 pm
Posts: 0
Free Member
 

I have a bit of code written in C# which downloads mapping data, so it's certainly possible to do that way, and once you've downloaded there are other classes available which will help parse the data.


 
Posted : 09/04/2013 2:06 pm
Posts: 0
Free Member
 

Also check out iMacros. It's a plugin for Chrome, Firefox and I think IE too. The syntax if very easy but it helps to know a little bit about the structure of webpages (i.e. what a div is)


 
Posted : 09/04/2013 2:11 pm