MegaSack DRAW - This year's winner is user - rgwb
We will be in touch

[Closed] Website data scraping

Chat Forum

Last Post by Steve77 13 years ago

13 Posts

10 Users

0 Reactions

127 Views

titusrider

Posts: 166

Free Member

Topic starter

So supposing I wanted to gather all the results data sets that are freely provided by the UCI here:
[url= http://www.uci.ch/templates/BUILTIN-NOFRAMES/Template3/layout.asp?MenuId=MjExMg ]UCI[/url]

and download whole sections of the website data at once (ie the whole 2013 Pro Road Season)

Is there any simple way I could do it?
do I need to start writing code?
What code? could I do it in C#?

Cheers

Posted : 09/04/2013 9:12 am

nbt

Posts: 12404

Full Member

I'd start by

1) ensuring that you are allowed to redistribute the results - just becasue you can access them, doesn;t mean you are allowed to republish elsewhere

2) learning about coding and deciding what language to learn then working out how to do a job in that language, rather than working out what language to learning to do a job

3) use the RSS feed rather than the website

Posted : 09/04/2013 9:21 am

damo2576

Posts: 0

Free Member

http://www.scrapy.org/

Posted : 09/04/2013 9:22 am

titusrider

Posts: 166

Free Member

Topic starter

thx, NBT

not planning on re-distributing just playing with some data vis software

I am an IT pro who mostly works in SQL and a smattering of C# but web is not my forte, was just wondering if my c# could be used to do this job (im not gonna teach myself a new language just to gather some cool demo data)

RSS feed would be amazing as I can take them in no problems but they only have RSS for news that I can find not results

Posted : 09/04/2013 9:28 am

sharkbait

Posts: 14057

Free Member

There is a scraper plugin for Firefox that works pretty well.

Posted : 09/04/2013 10:17 am

mogrim

Posts: 12079

Full Member

If all you want to do is play with the data the easiest way is probably just to copy the text you want into a (decent) text editor, then using regexp change the lines to SQL insert statements. For tabular data like this Excel is often an easy tool to use, just insert extra columns between the original table columns, and add the SQL you need (ie the apostrophes, commas, etc.).

If you want to learn screen scraping I'd use footie results - they're weekly, and always have the same format. There are also more data to play with, and it's a lot easier to work out what the next match is.

Posted : 09/04/2013 10:18 am

mogrim

Posts: 12079

Full Member

There is a scraper plugin for Firefox that works pretty well.

What's it called?

Posted : 09/04/2013 10:19 am

chvck

Posts: 349

Free Member

You could do it in c# but I suspect that python would be a lot easier. I wrote a script that scraped all the youtube urls for tracks posted in a topic on here, was quite easy in python. I don't have the script anymore though or I'd send it to you!

Posted : 09/04/2013 10:20 am

sharkbait

Posts: 14057

Free Member

What's it called?

Outwit Hub

Fairly powerful.

Posted : 09/04/2013 11:53 am

prezet

Posts: 2086

Free Member

Without an API to plug into scraping would become a laborious task. It would be easy enough (with PHP) to grab the contents of the page, and then using XPath go through and scrape the data you need.

Although you'd be better off targeting this page and modifying the query vars to suit your requirements:

[url= http://www.uci.infostradasports.com/asp/lib/TheASP.asp?PageID=19004&TaalCode=2&StyleID=0&SportID=102&CompetitionID=-1&EditionID=-1&EventID=-1&GenderID=1&ClassID=1&EventPhaseID=0&Phase1ID=0&Phase2ID=0&CompetitionCodeInv=1&PhaseStatusCode=262280&DerivedEventPhaseID=-1&SeasonID=484&StartDateSort=20121004&EndDateSort=20131020&Detail=1&DerivedCompetitionID=-1&S00=-3&S01=2&S02=1&PageNr0=-1&Cache=8 ]http://www.uci.infostradasports.com/asp/lib/TheASP.asp?PageID=19004&TaalCode=2&StyleID=0&SportID=102&CompetitionID=-1&EditionID=-1&EventID=-1&GenderID=1&ClassID=1&EventPhaseID=0&Phase1ID=0&Phase2ID=0&CompetitionCodeInv=1&PhaseStatusCode=262280&DerivedEventPhaseID=-1&SeasonID=484&StartDateSort=20121004&EndDateSort=20131020&Detail=1&DerivedCompetitionID=-1&S00=-3&S01=2&S02=1&PageNr0=-1&Cache=8[/url]

Posted : 09/04/2013 12:51 pm

llama

Posts: 3293

Full Member

oooh good a programming thread

If you are already in the msoft world then use C#

Ignore the 'xxxx is much better for this' and go with what you know. xxxxx might be nice, but C# is just as if not more productive. Nowadays there are loads of free to use libraries for this sort of stuff. Install the nuget package manager in your visual studio, and use that to search for for a nuget package that can do html parsing.

Posted : 09/04/2013 1:36 pm

mogrim

Posts: 12079

Full Member

sharkbait - Member
Outwit Hub
Fairly powerful.

Cheers, will take a look!

Edit: not compatible with my version of Firefox 🙁

Posted : 09/04/2013 1:59 pm

aracer

Posts: 0

Free Member

I have a bit of code written in C# which downloads mapping data, so it's certainly possible to do that way, and once you've downloaded there are other classes available which will help parse the data.

Posted : 09/04/2013 2:06 pm

Steve77

Posts: 0

Free Member

Also check out iMacros. It's a plugin for Chrome, Firefox and I think IE too. The syntax if very easy but it helps to know a little bit about the structure of webpages (i.e. what a div is)

Posted : 09/04/2013 2:11 pm

Latest Stories

Members’ Crossword Generator: give us a clue
by Mark Alker
Product of the Year: Maxxis Forekaster 3C Maxx Terra
by Ben Haworth
Best eMTB of the Year: Cotic Rocket
by Ben Haworth
Editors’ Choice 2025 – All our fave stuff of the year
by Singletrack Magazine