Monday, 29 December 2014

How to scrape address from Google Maps

If you want to build a new online directory based website and want it to be popular with latest web contents, then you need the help of web scraping services from iWeb scraping. If you want to scrape address from maps.google.com, there is a specialized web scraping tool developed by iWeb scraping which can do the job for you. There are plenty of benefits with web scraping which includes market research, gathering customer information, managing product catalogs, compare prices, gather real estate data, gather job posting information etc. Web scraping technology is very popular nowadays and it saves lot of time and effort involved in manual extraction of data from websites.

The web scraping tools developed iWeb Scraping is very user-friendly and can extract specific information from targeted websites. It converts data from HTML web pages to useful formats like Excel spread sheets or Access database. Whatever web scraping requirements you have, you can contact iWeb Scraping as they have more than 3.5 years of web data extraction experience and offer the best prices in the industry. Also their services are available in 24x7 basis and free pilot projects will be done based on request.

Companies which require specific web data and look for an application which can automate the process and export the HTML data in structured format could benefit greatly from web scraping applications of iWeb scraping. You can easily extract data from multiple target websites, parse and re-assemble the information in HTML format to database or spread sheets as you wish. The application has simple point-and-click user-interface and any beginner can use it scrape address from Google Maps. If you want to gather address of people in particular region from Google maps, you can do it with help of web scraping application developed by iWebscraping.

Web Scraping is a technology that able to digest target website databases that are visible only as HTML web pages, and create a local, identical replica of those databases as a information or result. With our web scraping & web data extraction service we can capture web pages, then pin-point specific pieces of data/information you'd like to extract from web pages. What is needed in this process is much more than a Website crawler and set of Website wrappers. The time required to do web data extraction goes down in comparison to manually data copying and pasting job.

Source:http://www.articlesbase.com/information-technology-articles/how-to-scrape-address-from-google-maps-4683906.html

Saturday, 27 December 2014

Damaged Or Affected Information Providers By Web Scraping Service

Data Scraping Services and computer hardware to grow. How is this possible? It's really simple. Computer systems installed and set in metal boxes and cabinets are a combination of electronic circuit cards. Conductive metal of choice because steel is very strong and affordable. Steel is often plated to prevent oxidation and corrosion.

Galvanizing material of choice because it is still relatively cheap, conductive, and provides a well finished appearance. Many computer enclosures are galvanized rack shelf supports, rails and other structural elements. Data Scraping Services are everywhere, they are not visible? Remember that Data Scraping Services thinner than a human hair and about You are looking for them to find them. Look for them to grow together.

Data Scraping Services exposed bridges and shorts of the circuit is still the potential to wreak havoc on a system. Remain important clues about what happens when the memory bus clock cycles during the installation of the latch is shorted? Maybe the data is corrupted. Perhaps the corruption will be detected and corrected by the error correction algorithms. Affect the data processor is actually an instruction

He logged on to various system disorders - are not logged in or track. If a reset clears the event, problem quickly annoying, but not - as significant is rejected. Often this is not the floor fixed management visibility. If the device must be set and they'll say: "Ask an IT manager ... No, why questions" Ask the operator to reset the equipment needs to be done and they will respond "... Of course, all the time why ask "

So if the Data Scraping Services are everywhere and are instruments to influence how it is not common knowledge? Most users of personal experience or get their information from reliable sources. If personal experience is unforgettable, it's human nature to discount and discard. If a jammed machine reset by filling a cup of coffee is memorable, it is not missed. Popping a diet is unusual and unforgettable. Clicking on the button is not. Data Scraping Services affected or influenced almost all providers.

If the  Services are plentiful, there are no problems?

Research has shown that Data Scraping Services to be reasonably attached to the host surface. Until a certain length, Data Scraping Services rub and rub until they are released by mechanical means such as related. After reaching a certain length, not only freedom from direct mechanical means is possible, but also as a more passive mode of vibration or air flow. Once expelled, Data Scraping Services are free to migrate within the environment.

Data Scraping Services need not be catastrophic failures. Bit errors, soft faults and other defects can be attributed to Data Scraping Services.

What is the treatment for Data Scraping Services?

In general, the accepted treatment to remove Data Scraping Services and is a pure version of the original source material. This tool is not suitable for every bad piece of the place, either a logistical or financial perspective. Does not mean that the problem should be ignored. . Will continue to grow Data Scraping Services. As they are today, they are potentially harmful.

Data Scraping Services through management training, all employees and visitors to the zinc whisker behavior are needed to sign the pledge. The promise Data Scraping Services staff and visitors are forced to treat seriously and will take no action that would aggravate the problem take. Their actions will reflect the best interests of users and reliable computing.

Conclusion

Data Scraping Services are more common than previously believed and accepted. At the same time we can keep up with Data Scraping Services can enjoy fairly reliable operation. But it is important to recognize and manage the situation - not ignore. Living with a chronic infectious disease is a useful model for operations.

Once a surface is the source of zinc whisker, it will always be a source of zinc whisker. Left alone, reliable operation can continue. When the need to interact with the surface, the material does not reveal the need for zinc whisker position.

Source:http://www.articlesbase.com/outsourcing-articles/damaged-or-affected-information-providers-by-web-scraping-service-5549982.html

Monday, 22 December 2014

Scraping table from html web with CloudStat

You need to use the data from internet, but don’t type, you can just extract or scrape them if you know the web URL.

Thanks to XML package from R. It provides amazing readHTMLtable() function.

For a study case,

I want to scrape data:

    US Airline Customer Score.
    World Top Chess Players (Men).

A. Scraping US Airline Customer Score table from

http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines

Code:

airline = ‘http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines’

airline.table = readHTMLTable(airline, header=T, which=1,stringsAsFactors=F)

Result:

B. Scraping World Top Chess players (Men) table from http://ratings.fide.com/top.phtml?list=men

Code:

chess = ‘http://ratings.fide.com/top.phtml?list=men’

chess.table = readHTMLTable(chess, header=T, which=5,stringsAsFactors=F)

Result:

Done. You had successfully scraping data from any web page with CloudStat.

You can get the full version of this study case (code and result) at Scraping table from html web.

Then, you can analyze as usual! Great! No more retype the data. Enjoy!

Source:http://www.r-bloggers.com/scraping-table-from-html-web-with-cloudstat/

Thursday, 18 December 2014

Extracting Wisdom Teeth Tips

It is believed that due to evolution, our jaws are now smaller than our ancient ancestors'. For this reason, our mouths often do not have adequate room to accommodate the third molars, making them basically useless and in some cases detrimental. Even if they are not impacted, wisdom teeth may be hard to clean, and therefore require removal to reduce the probability of caries and infection.

As part of your routine dental visits, your dentist will likely take X-rays to monitor the development of your third molars. Your dentist will likely recommend removing them as soon as possible to avoid any complications. The extraction of wisdom teeth can sometimes be a costly and daunting procedure; for these reasons many patients delay having them extracted. However, if the impacted teeth become infected, it is important to see your dental professional at once. Symptoms of infection due to impacted wisdom teeth include;

•    Pain in the gums and surrounding areas
•    Red or inflamed gums
•    Tender or bleeding gums
•    Inflammation around the face and jaw
•    Bad breath (halitosis)
•    Frequent headaches

If a single molar needs to be extracted, local anesthetic will be used. In the case where several or all the teeth need extraction, the patient will usually be "put under" using a general anesthetic. If you have an infection or medical complications that put you at a higher than normal risk, the surgery may be performed at a hospital. Extraction of the wisdom teeth is a day surgery, and patients are usually able to return to normal activities in a day or so. You may be prescribed antibiotics prior to the surgery, and you will likely be asked not to eat or drink the night before the surgery.

During the surgery, your dentist makes an incision in the gum tissue covering the tooth. Once the tooth is exposed, the dentist may cut the tooth into smaller pieces to make extraction easier. After the extraction you will be given stitches to mend the gum tissue. You may need to return a few days later to have the stitches removed. You will be monitored after the surgery to ensure that you are not bleeding excessively.

The best time for extraction is when the patient is in their late teens to avoid unnecessary complications. Wisdom teeth extractions performed later in life are still beneficial, but the removal may be more difficult and healing may take longer. Therefore it is wise to have a conversation with your dentist regarding your wisdom teeth as early as possible.

Most people will experience the emergence of their wisdom teeth at some point in their life, and extraction is sometimes necessary as a preventative measure or to fix an actual problem or to prevent problem. It is best to deal with any problems regarding your wisdom teeth as soon as possible to avoid unnecessary difficulties.

Source:http://ezinearticles.com/?Extracting-Wisdom-Teeth-Tips&id=7788863

Monday, 15 December 2014

Autoscraping casts a wider net

We have recently started letting more users into the private beta for our Autoscraping service. We’re receiving a lot of applications following the shutdown of Needlebase and we’re increasing our capacity to accommodate these users.

Natalia made a screencast to help our new users get started:

It’s also a great introduction to what this service can do.

We released slybot as an open source integration of the scrapely extraction library and the scrapy framework. This is the core technology behind the autoscraping service and we will make it easy to export autoscraping spiders from Scrapinghub  and run them completely with slybot – allowing our users to have the flexibility and freedom provided by open source.

Source:http://blog.scrapinghub.com/2012/02/27/autoscraping-casts-a-wider-net/

Saturday, 13 December 2014

Local ScraperWiki Library

It quite annoyed me that you can only use the scraperwiki library on a ScraperWiki instance; most of it could work fine elsewhere. So I’ve pulled it out (well, for Python at least) so you can use it offline.

How to use
pip install scraperwiki_local
A dump truck dumping its payload

You can then import scraperwiki in scripts run on your local computer. The scraperwiki.sqlite component is powered by DumpTruck, which you can optionally install independently of scraperwiki_local.

pip install dumptruck
Differences

DumpTruck works a bit differently from (and better than) the hosted ScraperWiki library, but the change shouldn’t break much existing code. To give you an idea of the ways they differ, here are two examples:

Complex cell values
What happens if you do this?
import scraperwiki
shopping_list = ['carrots', 'orange juice', 'chainsaw']
scraperwiki.sqlite.save([], {'shopping_list': shopping_list})
On a ScraperWiki server, shopping_list is converted to its unicode representation, which looks like this:
[u'carrots', u'orange juice', u'chainsaw']
In the local version, it is encoded to JSON, so it looks like this:
["carrots","orange juice","chainsaw"]


And if it can’t be encoded to JSON, you get an error. And when you retrieve it, it comes back as a list rather than as a string.

Case-insensitive column names
SQL is less sensitive to case than Python. The following code works fine in both versions of the library.

In [1]: shopping_list = ['carrots', 'orange juice', 'chainsaw']
In [2]: scraperwiki.sqlite.save([], {'shopping_list': shopping_list})
In [3]: scraperwiki.sqlite.save([], {'sHOpPiNg_liST': shopping_list})
In [4]: scraperwiki.sqlite.select('* from swdata')

Out[4]: [{u'shopping_list': [u'carrots', u'orange juice', u'chainsaw']}, {u'shopping_list': [u'carrots', u'orange juice', u'chainsaw']}]

Note that the key in the returned data is ‘shopping_list’ and not ‘sHOpPiNg_liST’; the database uses the first one that was sent. Now let’s retrieve the individual cell values.

In [5]: data = scraperwiki.sqlite.select('* from swdata')
In [6]: print([row['shopping_list'] for row in data])
Out[6]: [[u'carrots', u'orange juice', u'chainsaw'], [u'carrots', u'orange juice', u'chainsaw']]

The code above works in both versions of the library, but the code below only works in the local version; it raises a KeyError on the hosted version.

In [7]: print(data[0]['Shopping_List'])
Out[7]: [u'carrots', u'orange juice', u'chainsaw']

Here’s why. In the hosted version, scraperwiki.sqlite.select returns a list of ordinary dictionaries. In the local version, scraperwiki.sqlite.select returns a list of special dictionaries that have case-insensitive keys.

Develop locally

Here’s a start at developing ScraperWiki scripts locally, with whatever coding environment you are used to. For a lot of things, the local library will do the same thing as the hosted. For another lot of things, there will be differences and the differences won’t matter.

If you want to develop locally (just Python for now), you can use the local library and then move your script to a ScraperWiki script when you’ve finished developing it (perhaps using Thom Neale’s ScraperWiki scraper). Or you could just run it somewhere else, like your own computer or web server. Enjoy!

Source:https://blog.scraperwiki.com/2012/06/local-scraperwiki-library/

Sunday, 30 November 2014

Why scraping and why TheWebMiner?

If you read this blog you are one of two things: you are either interested in web scraping and you have studied this domain for quite a while, or you are just curious about this relatively new field of interest and want to know what it is, how it’s done and especially why. Either way it’s fine!

In case you haven’t googled already this I can tell you that data extraction (or scraping) is a technique in which a computer program extracts data from human-readable output coming from another program (wikipedia). Basically it can collect all the information on a certain subject from certain places. It’s sort of the equivalent of ctrl+f, at the scale of the whole internet. It’s nothing like the search engines that we currently use because it can extract the data in a certain file, as excel, csv (coma separated values) or any other that the buyer wants, and also extracts only the relevant data, only the values that you are interested in.

I hope now that you understand the concept and you are wondering just why would you need such data. Well let’s take the example of an online store, pretty common nowadays, and of course the manager just like any manager wants his business to thrive, so, for that he has to keep up with the other online stores. Now the web scraping takes place: it is very useful for him to have, saved as excels all the competitor’s prices of certain products if not all of them. By this he can maintain a fair pricing policy and always be ahead of his competitors by knowing all of their prices and fluctuations.  Of course the data collecting can also be done manually but this is not effective because we are talking of thousand of products each one having its own page and so on. This is only one example of situation in which scrapping is useful but there are hundreds and each one of them it’s profitable for the company.

By now I’ve talked about what it is and why you should be interested in it, from now on I’m going to explain why you should use thewebminer.com. First of all, it’s easy: you only have to specify what type of data you want and from where and we’ll manage the rest. Throughout the project you will receive first of all an approximation of price, followed by a time approximation. All the time you will be in contact with us so you can find out at any point what is the state of your project. The pricing policy is reasonable and depends on factors like the project size or complexity. For very big projects a discount may be applicable so the total cost be within reason.

Now I believe that thewebminer.com is able to manage with any kind of situation or requirement from users all over the world and to convince you, free samples are available at any project you may have or any uncertainty or doubt.

Source:http://thewebminer.com/blog/2013/07/

Wednesday, 26 November 2014

Screen scrapers: To program or to purchase?

Companies today use screen scraping tools for a variety of purposes, including collecting competitive information, capturing product specs, moving data between legacy and new systems, and keeping inventory or price lists accurate.

Because of their popularity and reputation as being extremely efficient tools for quickly gathering applicable display data, screen scraping tools or browser add-ons are a dime a dozen: some free, some low cost, and some part of a larger solution. Alternatively, you can build your own if you are (or know) a programming whiz. Each tool has its potential pros and cons, however, to keep in mind as you determine which type of tool would best fit your business need.

Program-your-own screen scraper

Pros:

    Using in-house resources doesn't require additional budget

Cons:

    Properly creating scripts to automate screen scraping can take a significant amount of time initially, and continues to take time in order to maintain the process. If, for instance, objects from which you're gathering data move on a web page, the entire process will either need to be re-automated, or someone with programming acumen will have to edit the script every time there is a change.

    It's questionable whether or not this method actually saves time and resources

Free or cheap scrapers

Pros:

    Here again, budget doesn't ever enter the picture, and you can drive the process yourself.

    Some tools take care of at least some of the programming heavy lifting required to screen scrape effectively

Cons:

    Many inexpensive screen scrapers require that you get up to speed on their programming language—a time-consuming process that negates the idea of efficiency that prompted the purchase.

Screen scraping as part of a full automation solution

Pros:

    In the amount of time it takes to perform one data extraction task, you have a completely composed script that the system writes for you

    It's the easiest to use out of all of the options

    Screen scraping is only part of the package; you can leverage automation software to automate nearly any task or process including tasks in Windows, Excel automation, IT processes like uploads, backups, and integrations, and business processes like invoice processing.

    You're likely to get buy-in for other automation projects (and visibility for the efficiency you're introducing to the organization) if you pick a solution with a clear and scalable business purpose, not simply a tool to accomplish a single task.

Cons:

    This option has the highest price tag because of its comprehensive capabilities.

Looking for more information?

Here are some options to dig deeper into screen scraping, and deciding on the right tool for you:

 Watch a couple demos of what screen scraping looks like with an automation solution driving the process.

 Read our web data extraction guide for a complete overview.

 Try screen scraping today by downloading a free trial.

Source: https://www.automationanywhere.com/screen-scrapers

Wednesday, 19 November 2014

Web Scraping for SEO with these Open-Source Scrapers

When conducting Search Engine Optimization (SEO), we’re required to scrape websites for data, our campaigns, and reports for our clients. At the lowest level we utilize scraping to keep track of rankings on search engines like Google, Bing, and Yahoo, even keep a track of links on websites to know when it’s completed its lifespan. Then we’ve used them to help us aggregate data from APIs, RSS feeds, and websites to conduct some of our data mining to find patterns to help us become more competitive. 

So scraping is a function majority of companies (SEOmoz, Raventools, and Google) have to do to either save money, protect intellectual property, track trends, etc… Businesses can find infinite uses with scraping tools, it just depends if you’re an printed circuit board manufacturer looking for ideas on your e-mail marketing campaign or a Orange County based business trying to keep an eye out on the competition. which is why we’ve created a comprehensive list of open source scrapers out there to help all the businesses out there. Just keep in mind we haven’t used all of them!

Words of caution, web scrapers require knowledge specific to the language such as PHP & cURL. Take into considerations issues like cookie management, fault tolerance, organizing the data properly, not crashing the website being scraped, and making sure the website doesn’t prohibit scraping.

If you’re ready, here’s the list…

Erlang

    eBot

Java

    Heritrix
    Nutch
    Piggy Bank
    WebSPHINX
    WebHarvest

PHP

    PHPCrawl
    Snoopy
    SpiderMonkey

Python

    BeautifulSoap
    HarvestMan
    Scrape.py
    Scrapemark
    Scrapy **
    Mechanize

Ruby

    Anemone
    scRUBYt

We’ll come back and update this list as we encounter more! If you would like to submit a solution we missed, feel free. Also we’re looking for guides related to each of these, so if you know of any or would be interested in guesting blogging about one, let us know!

Source:http://www.annexcore.com/blog/web-scraping-for-seo-with-these-open-source-scrapers/

Monday, 17 November 2014

Kimono Is A Smarter Web Scraper That Lets You “API-ify” The Web, No Code Required

A new Y Combinator-backed startup called Kimono wants to make it easier to access data from the unstructured web with a point-and-click tool that can extract information from webpages that don’t have an API available. And for non-developers, Kimono plans to eventually allow anyone track data without needing to understand APIs at all.

This sort of smarter “web scraper” idea has been tried before, and has always struggled to find more than a niche audience. Previous attempts with similar services like Dapper or Needlebase, for example, folded. Yahoo Pipes still chugs along, but it’s fair to say that the service has long since been a priority for its parent company.

But Kimono’s founders believe that the issue at hand is largely timing.

“Companies more and more are realizing there’s a lot of value in opening up some of their data sets via APIs to allow developers to build these ecosystems of interesting apps and visualizations that people will share and drive up awareness of the company,” says Kimono co-founder Pratap Ranade. (He also delves into this subject deeper in a Forbes piece here). But often, companies don’t know how to begin in terms of what data to open up, or how. Kimono could inform them.

Plus, adds Ranade, Kimono is materially different from earlier efforts like Dapper or Needlebase, because it’s outputting to APIs and is starting off by focusing on the developer user base, with an expansion to non-technical users planned for the future. (Meanwhile, older competitors were often the other way around).

The company itself is only a month old, and was built by former Columbia grad school companions Ranade and Ryan Rowe. Both left grad school to work elsewhere, with Rowe off to Frog Design and Ranade at McKinsey. But over the nearly half-dozen or so years they continued their careers paths separately, the two stayed in touch and worked on various small projects together.

One of those was Airpapa.com, a website that told you which movies were showing on your flights. This ended up giving them the idea for Kimono, as it turned out. To get the data they needed for the site, they had to scrape data from several publicly available websites.

“The whole process of cleaning that [data] up, extracting it on a schedule…it was kind of a painful process,” explains Rowe. “We spent most of our time doing that, and very little time building the website itself,” he says. At the same time, while Rowe was at Frog, he realized that the company had a lot of non-technical designers who needed access to data to make interesting design decisions, but who weren’t equipped to go out and get the data for themselves.

With Kimono, the end goal is to simplify data extraction so that anyone can manage it. After signing up, you install a bookmarklet in your browser, which, when clicked, puts the website into a special state that allows you to point to the items you want to track. For example, if you were trying to track movie times, you might click on the movie titles and showtimes. Then Kimono’s learning algorithm will build a data model involving the items you’ve selected.

Screen Shot 2014-02-18 at 4.29.05 PM

Screen Shot 2014-02-18 at 4.29.27 PM

That data can be tracked in real time and extracted in a variety of ways, including to Excel as a .CSV file, to RSS in the form of email alerts, or for developers as a RESTful API that returns JSON. Kimono also offers “Kimonoblocks,” which lets you drop the data as an embed on a webpage, and it offers a simple mobile app builder, which lets you turn the data into a mobile web application.

Screen Shot 2014-02-18 at 4.29.50 PM

For developer users, the company is currently working on an API editor, which would allow you to combine multiple APIs into one.

So far, the team says, they’ve been “very pleasantly surprised” by the number of sign-ups, which have reached ten thousand*. And even though only a month old, they’ve seen active users in the thousands.

Initially, they’ve found traction with hardware hackers who have done fun things like making an airhorn blow every time someone funds their Kickstarter campaign, for instance, as well as with those who have used Kimono for visualization purposes, or monitoring the exchange rates of various cryptocurrencies like Bitcoin and dogecoin. Others still are monitoring data that’s later spit back out as a Twitter bot.

Kimono APIs are now making over 100,000 calls every week, and usage is growing by over 50 percent per week. The company also put out an unofficial “Sochi Olympics API” to showcase what the platform can do.

The current business model is freemium based, with pricing that kicks in for higher-frequency usage at scale.

The Mountain View-based company is a team of just the two founders for now, and has initial investment from YC, YC VC and SV Angel.

Source:http://techcrunch.com/2014/02/18/kimono-is-a-smarter-web-scraper-that-lets-you-api-ify-the-web-no-code-required/

Saturday, 15 November 2014

Building Java Object Graph with Tour de France results – using screen scraping, java.util.Parser and assorted facilities

Last Saturday, the Tour de France 2011 departed. For people like myself, enjoying sports and working on Data Visualizations on the one hand and far fetched uses of SQL on the other, the Tour de France offers a wealth of data to work with: rankings for each stage in various categories, nationalities and teams to group by, distances and velocity, years to compare with one another and the like. So it has been my intention for some time to get hold of that data in a format I could work with.

Today I finally found some time to get it done. To locate the statistics for the Tour de France editions for the last few years and get them onto my laptop and into my database. This article describes the first part of that journey: how to get the stage results from some source on the internet into my locally running Java program in an appropriate object structure.

My starting point is the official Tour de France website:

Image

This website goes back to 2007 and also has the latest (2011) results. It presents the result in a format pleasing to the human eye – based on an HTML structure that is fairly pleasing to my groping Java code as well.

Analyzing the source of the Tour de France data

I start my explorations in Firefox, using the Firebug plugin. When I select the tab with the results for a particular stage, I inspect the (AJAX) call that is made to retrieve the stage results into the browser:

Image

The URL that was accessed is www.letour.fr/2010/TDF/LIVE/us/700/classement/ITE.html . When I access that URL directly, I see an HTML fragment with the individual ranking for the 7th stage in 2010. It turns out that with ITG instead of ITE in this URL, I get the overall ranking after the 7th Stage. Using IME in stead of ITE, I get the 7th stage’s climbers’ standing. And so on.

The HTML associated with the stage standing looks like this:

Image

Which is not as user friendly as the corresponding display in the browser:

Image

but still fairly well structured and programmatically interpretable.

Retrieving HTML fragments and parsing in Java

Consuming these HTML fragments with stage standings into my own Java code is very easy. Parsing the data and turning it into sensible Java Objects is slightly more work, but still quite feasible. From the Java Objects I next need to create a persistent storage for the data – that is the subject for another article.

Using the Java URL class and its openStream method to open an InputStream on whatever content can be found at the URL, it is dead easy to start reading the HTML from the Tour de France website into my Java program. I make use of the java.util.Scanner class to work my way through the HTML by Table Row (TR element). When you inspect the HTML fragments, it is clear early on that every individual rider’s entry corresponds with a TR element, so it seems only logical to have the Scanner break up the data by TR.

private static Stage processStage(int year, int stageSequence, Map<Integer, Rider> riders) throws java.io.IOException, java.net.MalformedURLException {

    String typeOfStanding = "ITE";
     URL stageStanding = new URL("http://www.letour.fr/"+year+"/TDF/LIVE/us/"
                                +(stageSequence==0?"0":stageSequence+"00") +
                                "/classement/"+typeOfStanding+".html");
    InputStream stream = stageStanding.openStream();
    Scanner scanner = new Scanner(stream);
    scanner.useDelimiter("</tr>");
    Stage stage = new Stage();
    stage.setSequence(stageSequence);
    boolean first = true;
    boolean firstStanding = true;
    while (scanner.hasNext()) {
        String entry = scanner.next();
        if (first) {
            first = false;
            Matcher regexMatcher = regexDistance.matcher(entry);
            if (regexMatcher.find()) {
                String distanceString = regexMatcher.group();
                stage.setTotalDistance(Float.parseFloat(distanceString.substring(0, distanceString.length() - 3)));
            }
        }
        if (!first) {
            String[] els = entry.split("/td>");
            if (els.length > 1) { // only the standing-entries have more than one td element
                Integer riderNumber = Integer.parseInt(extractValue(els[2]));

                Rider rider=null;
                if (riders.containsKey(riderNumber)) {
                    rider = riders.get(riderNumber);
                }
                else {
                    rider = new Rider(extractValue(els[1]),riderNumber, extractValue(els[3]));
                    riders.put(riderNumber,rider);
                }
                Standing standing =
                    new Standing(firstStanding ? 1 : (Integer.parseInt(extractValue(els[0]).replace(".", ""))),
                                  rider,extractValue(els[4]),
                                  extractValue(els[5]));
                firstStanding = false;
                stage.getStandings().add(standing);                }
        }
    } //while
    scanner.close();
    return stage;
}

Subsequently, the TR elements need to be broken up in the TD cell elements that contain the rank, rider’s name, their number, the team they ride for and the time for the stage as well as their lag with regard to the winner. I have used a simple split (on /td>) to extract the cells. The final logic for pulling the correct value from the cell is in the method extractValue. Note: this code is not very pretty, and I am not necessarily overly proud of it. On the other hand: it is one-time-use-only code and it is still fairly compact and easy to write and read.

private static String extractValue(String el) {
    String r = el.split("</")[0];
    if (r.lastIndexOf(">") > 0) {
        r = r.substring(r.lastIndexOf(">") + 1);
    }
    return r.split("<")[0];
}

I have created a few domain classes: Rider, Stage, Standing (as well as Tour) that are a business domain like representation of the Tour de France result data. Objects based on these classes are instantiated in the processStage method that is being invoked from the processTour method.

public static void processTour(Tour tour) throws IOException, MalformedURLException {
    if (tour.isPrologue())
      tour.getStages().add(processStage(tour.getYear(),0, tour.getRiders()));

    for (int i=1;i<= tour.getNumberOfStages();i++)  {
        tour.getStages().add(processStage(tour.getYear(),i, tour.getRiders()));
    }
}

When I run the TourManager class – a class that create a single Tour object for the Tour de France in 2010 –

public class TourManager {
     List<Tour> tours = new ArrayList<Tour>();
     public TourManager() {
        tours.add(new Tour(2010, 20, true));
        try {
            ProcessTourStandings.processTour(tours.get(0));
        } catch (MalformedURLException e) {
            System.out.println(e.getMessage());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
     public static void main(String[] args) {
        TourManager tm = new TourManager();
        for (Tour tour : tm.getTours()) {
            for (Stage stage : tour.getStages()) {
                System.out.println("================ Stage " + stage.getSequence() + "(" + stage.getTotalDistance() +
                                   " km)");
                for (Standing standing : stage.getStandings()) {
                    if (standing.getRank() < 4) {
                        System.out.println(standing.getRank() + "." + standing.getRider().getName());
                    }
                }
            }
        }
    }

it will print the top 3 in every stage:

Image

Source:http://technology.amis.nl/2011/07/04/building-java-object-graph-with-tour-de-france-results-using-screen-scraping-java-util-parser-and-assorted-facilities/

Wednesday, 12 November 2014

A Content Marketer's Guide to Data Scraping

As digital marketers, big data should be what we use to inform a lot of the decisions we make. Using intelligence to understand what works within your industry is absolutely crucial within content campaigns, but it blows my mind to know that so many businesses aren't focusing on it.

One reason I often hear from businesses is that they don't have the budget to invest in complex and expensive tools that can feed in reams of data to them. That said, you don't always need to invest in expensive tools to gather valuable intelligence — this is where data scraping comes in.

Just so you understand, here's a very brief overview of what data scraping is from Wikipedia:

    "Data scraping is a technique in which a computer program extracts data from human-readable output coming from another program."

Essentially, it involves crawling through a web page and gathering nuggets of information that you can use for your analysis. For example, you could search through a site like Search Engine Land and scrape the author names of each of the posts that have been published, and then you could correlate this to social share data to find who the top performing authors are on that website.

Hopefully, you can start to see how this data can be valuable. What's more, it doesn't require any coding knowledge — if you're able to follow my simple instructions, you can start gathering information that will inform your content campaigns. I've recently used this research to help me get a post published on the front page of BuzzFeed, getting viewed over 100,000 times and channeling a huge amount of traffic through to my blog.

Disclaimer: One thing that I really need to stress before you read on is the fact that scraping a website may breach its terms of service. You should ensure that this isn't the case before carrying out any scraping activities. For example, Twitter completely prohibits the scraping of information on their site. This is from their Terms of Service:

    "crawling the Services is permissible if done in accordance with the provisions of the robots.txt file, however, scraping the Services without the prior consent of Twitter is expressly prohibited"

Google similarly forbids the scraping of content from their web properties:

    Google's Terms of Service do not allow the sending of automated queries of any sort to our system without express permission in advance from Google.

So be careful, kids.

Content analysis

Mastering the basics of data scraping will open up a whole new world of possibilities for content analysis. I'd advise any content marketer (or at least a member of their team) to get clued up on this.

Before I get started on the specific examples, you'll need to ensure that you have Microsoft Excel on your computer (everyone should have Excel!) and also the SEO Tools plugin for Excel (free download here). I put together a full tutorial on using the SEO tools plugin that you may also be interested in.

Alongside this, you'll want a web crawling tool like Screaming Frog's SEO Spider or Xenu Link Sleuth (both have free options). Once you've got these set up, you'll be able to do everything that I outline below.

So here are some ways in which you can use scraping to analyse content and how this can be applied into your content marketing campaigns:

1. Finding the different authors of a blog

Analysing big publications and blogs to find who the influential authors are can give you some really valuable data. Once you have a list of all the authors on a blog, you can find out which of those have created content that has performed well on social media, had a lot of engagement within the comments and also gather extra stats around their social following, etc.

I use this information on a daily basis to build relationships with influential writers and get my content placed on top tier websites. Here's how you can do it:

Step 1: Gather a list of the URLs from the domain you're analysing using Screaming Frog's SEO Spider. Simply add the root domain into Screaming Frog's interface and hit start (if you haven't used this tool before, you can check out my tutorial here).

Once the tool has finished gathering all the URLs (this can take a little while for big websites), simply export them all to an Excel spreadsheet.

Step 2: Open up Google Chrome and navigate to one of the article pages of the domain you're analysing and find where they mention the author's name (this is usually within an author bio section or underneath the post title). Once you've found this, right-click their name and select inspect element (this will bring up the Chrome developer console).

Within the developer console, the line of code associated to the author's name that you selected will be highlighted (see the below image). All you need to do now is right-click on the highlighted line of code and press Copy XPath.

For the Search Engine Land website, the following code would be copied:

//*[@id="leftCol"]/div[2]/p/span/a

This may not make any sense to you at this stage, but bear with me and you'll see how it works.

Step 3: Go back to your spreadsheet of URLs and get rid of all the extra information that Screaming Frog gives you, leaving just the list of raw URLs – add these to the first column (column A) of your worksheet.

Step 4: In cell B2, add the following formula:

=XPathOnUrl(A2,"//*[@id='leftCol']/div[2]/p/span/a")

Just to break this formula down for you, the function XPathOnUrl allows you to use the XPath code directly within (this is with the SEO Tools plugin installed; it won't work without this). The first element of the function specifies which URL we are going to scrape. In this instance I've selected cell A2, which contains a URL from the crawl I did within Screaming Frog (alternatively, you could just type the URL, making sure that you wrap it within quotation marks).

Finally, the last part of the function is our XPath code that we gathered. One thing to note is that you have to remove the quotation marks from the code and replace them with apostrophes. In this example, I'm referring to the "leftCol" section, which I've changed to ‘leftCol' — if you don't do this, Excel won't read the formula correctly.

Once you press enter, there may be a couple of seconds delay whilst the SEO Tools plugin crawls the page, then it will return a result. It's worth mentioning that within the example I've given above, we're looking for author names on article pages, so if I try to run this on a URL that isn't an article (e.g. the homepage) I will get an error.

For those interested, the XPath code itself works by starting at the top of the code of the URL specified and following the instructions outlined to find on-page elements and return results. So, for the following code:

//*[@id='leftCol']/div[2]/p/span/a

We're telling it to look for any element (//*) that has an id of leftCol (@id='leftCol') and then go down to the second div tag after this (div[2]), followed by a p tag, a span tag and finally, an a tag (/p/span/a). The result returned should be the text within this a tag.

Don't worry if you don't understand this, but if you do, it will help you to create your own XPath. For example, if you wanted to grab the output of an a tag that has rel=author attached to it (another great way of finding page authors), then you could use some XPath that looked a little something like this:

//a[@rel='author']

As a full formula within Excel it would look something like this:

=XPathOnUrl(A2,"//a[@rel='author']")

Once you've created the formula, you can drag it down and apply it to a large number of URLs all at once. This is a huge time-saver as you'd have to manually go through each website and copy/paste each author to get the same results without scraping – I don't need to explain how long this would take.

Now that I've explained the basics, I'll show you some other ways in which scraping can be used…

2. Finding extra details around page authors

So, we've found a list of author names, which is great, but to really get some more insight into the authors we will need more data. Again, this can often be scraped from the website you're analysing.

Most blogs/publications that list the names of the article author will actually have individual author pages. Again, using Search Engine Land as an example, if you click my name at the top of this post you will be taken to a page that has more details on me, including my Twitter profile, Google+ profile and LinkedIn profile. This is the kind of data that I'd want to gather because it gives me a point of contact for the author I'm looking to get in touch with.

Here's how you can do it.

Step 1: First we need to get the author profile URLs so that we can scrape the extra details off of them. To do this, you can use the same approach to find the author's name, with just a little addition to the formula:

=XPathOnUrl(A2,"//a[@rel='author']", <strong>"href"</strong>)

The addition of the "href" part of the formula will extract the output of the href attribute of the atag. In Lehman terms, it will find the hyperlink attached to the author name and return that URL as a result.

Step 2: Now that we have the author profile page URLs, you can go on and gather the social media profiles. Instead of scraping the article URLs, we'll be using the profile URLs.

So, like last time, we need to find the XPath code to gather the Twitter, Google+ and LinkedIn links. To do this, open up Google Chrome and navigate to one of the author profile pages, right-click on the Twitter link and select Inspect Element.

Once you've done this, hover over the highlighted line of code within Chrome's developer tools, right-click and select Copy XPath.

Step 3: Finally, open up your Excel spreadsheet and add in the following formula (using the XPath that you've copied over):

=XPathOnUrl(C2,"//*[@id='leftCol']/div[2]/p/a[2]", "href")

Remember that this is the code for scraping Search Engine Land, so if you're doing this on a different website, it will almost certainly be different. One important thing to highlight here is that I've selected cell C2 here, which contains the URL of the author profile page and not just the article page. As well as this, you'll notice that I've included "href" at the end because we want the actual Twitter profile URL and not just the words ‘Twitter'.

You can now repeat this same process to get the Google+ and LinkedIn profile URLs and add it to your spreadsheet. Hopefully you're starting to see the value in this, and how it can be used to gather a lot of intelligence that can be used for all kinds of online activity, not least your SEO and social media campaigns.

3. Gathering the follower counts across social networks

Now that we have the author's social media accounts, it makes sense to get their follower counts so that they can be ranked based on influence within the spreadsheet.

Here are the final XPath formulae that you can plug straight into Excel for each network to get their follower counts. All you'll need to do is replace the text INSERT SOCIAL PROFILE URL with the cell reference to the Google+/LinkedIn URL:

Google+:

=XPathOnUrl(<strong>INSERTGOOGLEPROFILEURL</strong>,"//span[@class='BOfSxb']")

LinkedIn:

=XPathOnUrl(<strong>INSERTLINKEDINURL</strong>,"//dd[@class='overview-connections']/p/strong")

4. Scraping page titles

Once you've got a list of URLs, you're going to want to get an idea of what the content is actually about. Using this quick bit of XPath against any URL will display the title of the page:

=XPathOnUrl(A2,"//title")

To be fair, if you're using the SEO Tools plugin for Excel then you can just use the built-in feature to scrape page titles, but it's always handy to know how to do it manually!

A nice extra touch for analysis is to look at the number of words used within the page titles. To do this, use the following formula:

=CountWords(A2)

From this you can get an understanding of what the optimum title length of a post within a website is. This is really handy if you're pitching an article to a specific publication. If you make the post the best possible fit for the site and back up your decisions with historical data, you stand a much better chance of success.

Taking this a step further, you can gather the social shares for each URL using the following functions:

Twitter:

=TwitterCount(<strong>INSERTURLHERE</strong>)

Facebook:

=FacebookLikes(<strong>INSERTURLHERE</strong>)

Google+:

=GooglePlusCount(<strong>INSERTURLHERE</strong>)

Note: You can also use a tool like URL Profiler to pull in this data, which is much better for large data sets. The tool also helps you to gather large chunks of data from other social networks, link data sources like Ahrefs, Majestic SEO and Moz, which is awesome.

If you want to get even more social stats then you can use the SharedCount API, and this is how you go about doing it…

Firstly, create a new column in your Excel spreadsheet and add the following formula (where A2 is the URL of the webpage you want to gather social stats for):

=CONCATENATE("http://api.sharedcount.com/?url=",A2)

You should now have a cell that contains your webpage URL prefixed with the SharedCount API URL. This is what we will use to gather social stats. Now here's the Excel formula to use for each network (where B2 is the cell that contaiins the formula above):

StumbleUpon:

=JsonPathOnUrl(B2,"StumbleUpon")

Reddit:

=JsonPathOnUrl(B2,"Reddit")

Delicious:

=JsonPathOnUrl(B2,"Delicious")

Digg:

=JsonPathOnUrl(B2,"Diggs")

Pinterest:

=JsonPathOnUrl(B2,"Pinterest")

LinkedIn:

=JsonPathOnUrl(B2,"Linkedin")

Facebook Shares:

=JsonPathOnUrl(B2,"Facebook.share_count")

Facebook Comments:

=JsonPathOnUrl(B2,"Facebook.comment_count")

Once you have this data, you can start looking much deeper into the elements of a successful post. Here's an example of a chart that I created around a large sample of articles that I analysed within Upworthy.com.

The chart looks at the average number of social shares that an article on Upworthy receives vs the number of words within its title. This is invaluable data that can be used across a whole host of different on-page elements to get the perfect article template for the site you're pitching to.

See, big data is useful!

5. Date/time the post was published

Along with analysing the details of headlines that are working within a site, you may want to look at the optimal posting times for best results. This is something that I regularly do within my blogs to ensure that I'm getting the best possible return from the time I spend writing.

Every site is different, which makes it very difficult for an automated, one-size-fits-all tool to gather this information. Some sites will have this data within the <head> section of their webpages, but others will display it directly under the article headline. Again, Search Engine Land is a perfect example of a website doing this…

So here's how you can scrape this information from the articles on Search Engine Land:

=XPathOnUrl(<strong>INSERTARTICLEURL</strong>,"//*[@class='dateline']/text()")

Now you've got the date and time of the post. You may want to trim this down and reformat it for your data analysis, but you've got it all in Excel so that should be pretty easy.

Extra reading

Data scraping is seriously powerful, and once you've had a bit of a play around with it you'll also realise that it's not that complicated. The examples that I've given are just a starting point but once you get your creative head on, you'll soon start to see the opportunities that arise from this intelligence.

Here's some extra reading that you might find useful:

    http://findmyblogway.com/scraping-communities-with-xpath/

    http://builtvisible.com/data-entry-is-a-waste-of-time/

    http://www.seotakeaways.com/data-scraping-guide-for-seo/

    http://okdork.com/2014/04/30/the-step-by-step-guide-to-10x-growth-for-any-blog/

TL;DR

    Start using actual data to inform your content campaigns instead of going on your gut feeling.

    Gather intelligence around specific domains you want to target for content placement and create the perfect post for their audience.

    Get clued up on XPath and JSON through using the SEO Tools plugin for Excel.

    Spend more time analysing what content will get you results as opposed to what sites will give you links!

    Check the website's ToS before scraping.

Source:http://moz.com/blog/a-content-marketers-guide-to-data-scraping

Monday, 10 November 2014

Review: import.io’s New Scraping Process and Features

Web scraping Data platform import.io, announced last week that they have secured $3M in funding from investors that include the founders of Yahoo! and MySQL.

They also released a new beta version of the tool that is essentially a better version of their extraction tool, with some new features and a much cleaner and faster user experience.

First Impression
I’ve used the tool for a week and can say it is an improvement over the old version – which was a bit bulky and awkward. While still not exactly the most intuitive process, the development team at import.io has managed to slim down what was a relatively button heavy process, without sacrificing any of the functionality – they made the new workflow both simpler and more complicated at the same.

The new version features a simple tool bar across the top as opposed to the space hogging table and wizard from before, which is a large improvement on the pink and white of the previous version.

True, the loss of the wizard means there isn’t as much guidance as before (the pop-up help only appears on the first use), but the undo button means you don’t really need it. You can click around and experiment a bit with the different extraction options before settling down to do some real work.

Data Extraction

Once you’ve figured out how it works, the new version requires far fewer mouse clicks to get from the page to a table of data/API as shown in their homepage video.

All you need to do now is navigate to a website, click a single piece of data on the page – such as price, image, or URL – and their app will find all the other examples of similar data on the website, immediately creating a structured table of data.

download2

This latest version of the extractor also includes a important new feature labeled “Suggest Data”. Its important because it lets you extract all the data from a page, instantly creating a table of data that can be published as an API. This makes import.io very exciting and quick, I spent a long time playing with this and it worked on the majority of sites.

Advanced Features

Most non-programmer web scrapers struggle with complex sites that use JavaScript or iFrames, but import.io also now deals with this. In the basic mode you can toggle JavaScript and CSS on and off to help you see your data better.

If that doesn’t work, you can switch into an ‘advanced mode’ where import allows you to write your own XPath and RegExp. They’ve also added a source code view, though without the ability to click on the site and inspect element (like in Chrome) this feature isn’t particularly useful.

API Integration

Once you’ve created your scraper, there are a number of options for what you can do with it.

If you’ want you can just copy and paste the data into a spreadsheet or Download as CSV. You can also push your data directly Google Sheets, with import.io’s self generated formula.

For the rest of us, they have surfaced both the POST and GET requests for you and given you a JSON view which allows you to see how the data is returned, which is handy.

All this functionality is nice, and it’s clear they’re trying to cater to all technical levels, but it has made the API page somewhat messy and potentially confusing for newer or less technical users, but they should be able to get what they need.

Good with lots of Potential

Their new tool certainly isn’t perfect. There are still a few sites where manual row training is required and you can’t access the authentication feature (though you can still do this in the old version) or pagination.

Even if it’s not quite there yet, if import.io continue like this, they are well on its way to becoming the best data scraping platform on the market. Especially when you consider the “free for life” price tag.

Source:http://scraping.pro/review-import-ios-new-scraping-tools-features/

Wednesday, 5 November 2014

Why People Hesitate To Try Data Mining

What is hindering a number of people from venturing into the promising world of data mining? Despite so much encouragement, promotions, testimonials, and evidences of the benefits of online data collection, still only a handful take the challenge and really gain the pay offs it has to offer.

It may sound unthinkable that such an opportunity for success has been neglected by many. It may also sound absurd why many well-meaning individuals are hindered from enjoying the benefits of the blessings of the 21st century.

The Causes

After considerable observation and analysis of the human psyche, one can understand the underlying reasons behind the hesitance to try the profitable data mining service. The most common reasons why people are afraid to try new technology or why they remain passive and uninvolved are: fear; lack of knowledge; and pride.

Fear. The most paralyzing of human emotions is fear. It can, to some extent, cause a person to be insane, unprofitable, sick, and lost. Although fear is a normal reaction to certain stimuli and a natural feeling experienced by humans, it must always be monitored and controlled.  Usually, people share common fears, such as: fear of change; fear of anything new; and fear of the unknown.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/people-hesitate-try-data-mining/

Monday, 8 September 2014

Scraping webdata from a website that loads data in a streaming fashion

I'm trying to scrape some data off of the FEC.gov website using python for a project of mine. Normally I use python

mechanize and beautifulsoup to do the scraping.

I've been able to figure out most of the issues but can't seem to get around a problem. It seems like the data is

streamed into the table and mechanize.Browser() just stops listening.

So here's the issue: If you visit http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A ... you get the first 500

contributors whose last name starts with A and have given money to candidate P80003338 ... however, if you use

browser.open() at that url all you get is the first ~5 rows.

I'm guessing its because mechanize isn't letting the page fully load before the .read() is executed. I tried putting a

time.sleep(10) between the .open() and .read() but that didn't make much difference.

And I checked, there's no javascript or AJAX in the website (or at least none are visible when you use the 'view-

source'). SO I don't think its a javascript issue.

Any thoughts or suggestions? I could use selenium or something similar but that's something that I'm trying to avoid.

-Will

2 Answers

Why not use an html parser like lxml with xpath expressions.

I tried

>>> import lxml.html as lh
>>> data = lh.parse('http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A')
>>> name = data.xpath('/html/body/table[2]/tr[5]/td[1]/a/text()')
>>> name
[' AABY, TRYGVE']
>>> name = data.xpath('//table[2]/*/td[1]/a/text()')
>>> len(name)
500
>>> name[499]
' AHMED, ASHFAQ'
>>>



Similarly, you can create xpath expression of your choice to work with.


Source: http://stackoverflow.com/questions/9435512/scraping-webdata-from-a-website-that-loads-data-in-a-streaming-

fashion

How can I circumvent page view limits when scraping web data using Python?

I am using Python to scrape US postal code population data from http:/www.city-data.com, through this directory: http://www.city-data.com/zipDir.html. The specific pages I am trying to scrape are individual postal code pages with URLs like this: http://www.city-data.com/zips/01001.html. All of the individual zip code pages I need to access have this same URL Format, so my script simply does the following for postal_code in range:

    Creates URL given postal code
    Tries to get response from URL
    If (2), Check the HTTP of that URL
    If HTTP is 200, retrieves the HTML and scrapes the data into a list
    If HTTP is not 200, pass and count error (not a valid postal code/URL)
    If no response from URL because of error, pass that postal code and count error
    At end of script, print counter variables and timestamp

The problem is that I run the script and it works fine for ~500 postal codes, then suddenly stops working and returns repeated timeout errors. My suspicion is that the site's server is limiting the page views coming from my IP address, preventing me from completing the amount of scraping that I need to do (all 100,000 potential postal codes).

My question is as follows: Is there a way to confuse the site's server, for example using a proxy of some kind, so that it will not limit my page views and I can scrape all of the data I need?

Thanks for the help! Here is the code:

##POSTAL CODE POPULATION SCRAPER##

import requests

import re

import datetime

def zip_population_scrape():

    """
    This script will scrape population data for postal codes in range
    from city-data.com.
    """
    postal_code_data = [['zip','population']] #list for storing scraped data

    #Counters for keeping track:
    total_scraped = 0
    total_invalid = 0
    errors = 0


    for postal_code in range(1001,5000):

        #This if statement is necessary because the postal code can't start
        #with 0 in order for the for statement to interate successfully
        if postal_code <10000:
            postal_code_string = str(0)+str(postal_code)
        else:
            postal_code_string = str(postal_code)

        #all postal code URLs have the same format on this site
        url = 'http://www.city-data.com/zips/' + postal_code_string + '.html'

        #try to get current URL
        try:
            response = requests.get(url, timeout = 5)
            http = response.status_code

            #print current for logging purposes
            print url +" - HTTP:  " + str(http)

            #if valid webpage:
            if http == 200:

                #save html as text
                html = response.text

                #extra print statement for status updates
                print "HTML ready"

                #try to find two substrings in HTML text
                #add the substring in between them to list w/ postal code
                try:           

                    found = re.search('population in 2011:</b> (.*)<br>', html).group(1)

                    #add to # scraped counter
                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])

                    #print statement for logging
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
                #if substrings not found, try searching for others
                #and doing the same as above   
                except AttributeError:
                    found = re.search('population in 2010:</b> (.*)<br>', html).group(1)

                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."

            #if http =404, zip is not valid. Add to counter and print log        
            elif http == 404:
                total_invalid +=1

                print postal_code_string + ": Not a valid zip code. " + str(total_invalid) + " total invalid zips."

            #other http codes: add to error counter and print log
            else:
                errors +=1

                print postal_code_string + ": HTTP Code Error. " + str(errors) + " total errors."

        #if get url fails by connnection error, add to error count & pass
        except requests.exceptions.ConnectionError:
            errors +=1
            print postal_code_string + ": Connection Error. " + str(errors) + " total errors."
            pass

        #if get url fails by timeout error, add to error count & pass
        except requests.exceptions.Timeout:
            errors +=1
            print postal_code_string + ": Timeout Error. " + str(errors) + " total errors."
            pass


    #print final log/counter data, along with timestamp finished
    now= datetime.datetime.now()
    print now.strftime("%Y-%m-%d %H:%M")
    print str(total_scraped) + " total zips scraped."
    print str(total_invalid) + " total unavailable zips."
    print str(errors) + " total errors."



Source: http://stackoverflow.com/questions/25452798/how-can-i-circumvent-page-view-limits-when-scraping-web-data-using-python

Sunday, 7 September 2014

Web data scraping (online news comments) with Scrapy (Python)

Since you seem like the try-first ask-question later type (that's a very good thing), I won't give you an answer, but a

(very detailed) guide on how to find the answer.

The thing is, unless you are a yahoo developer, you probably don't have access to the source code you're trying to

scrape. That is to say, you don't know exactly how the site is built and how your requests to it as a user are being

processed on the server-side. You can, however, investigate the client-side and try to emulate it. I like using Chrome

Developer Tools for this, but you can use others such as FF firebug.

So first off we need to figure out what's going on. So the way it works, is you click on the 'show comments' it loads

the first ten, then you need to keep clicking for the next ten comments each time. Notice, however, that all this

clicking isn't taking you to a different link, but lively fetches the comments, which is a very neat UI but for our

case requires a bit more work. I can tell two things right away:

    They're using javascript to load the comments (because I'm staying on the same page).
    They load them dynamically with AJAX calls each time you click (meaning instead of loading the comments with the

page and just showing them to you, with each click it does another request to the database).

Now let's right-click and inspect element on that button. It's actually just a simple span with text:

<span>View Comments (2077)</span>

By looking at that we still don't know how that's generated or what it does when clicked. Fine. Now, keeping the

devtools window open, let's click on it. This opened up the first ten. But in fact, a request was being made for us to

fetch them. A request that chrome devtools recorded. We look in the network tab of the devtools and see a lot of

confusing data. Wait, here's one that makes sense:

http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=42f7f6e0-7bae-33d3-aa1d-

3dfc7fb5cdfc&_device=full&count=10&sortBy=highestRated&isNext=true&offset=20&pageNumber=2&_media.modules.content_commen

ts.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_com

ment=1

See? _xhr and then get_comments. That makes a lot of sense. Going to that link in the browser gave me a JSON object

(looks like a python dictionary) containing all the ten comments which that request fetched. Now that's the request you

need to emulate, because that's the one that gives you what you want. First let's translate this to some normal reqest

that a human can read:

go to this url: http://news.yahoo.com/_xhr/contentcomments/get_comments/
include these parameters: {'_device': 'full',
          '_media.modules.content_comments.switches._enable_mutecommenter': '1',
          '_media.modules.content_comments.switches._enable_view_others': '1',
          'content_id': '42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc',
          'count': '10',
          'enable_collapsed_comment': '1',
          'isNext': 'true',
          'offset': '20',
          'pageNumber': '2',
          'sortBy': 'highestRated'}

Now it's just a matter of trial-and-error. However, a few things to note here:

    Obviously the count is what decides how many comments you're getting. I tried changing it to 100 to see what

happens and got a bad request. And it was nice enough to tell me why - "Offset should be multiple of total rows". So

now we understand how to use offset

    The content_id is probably something that identifies the article you are reading. Meaning you need to fetch that

from the original page somehow. Try digging around a little, you'll find it.

    Also, you obviously don't want to fetch 10 comments at a time, so it's probably a good idea to find a way to fetch

the number of total comments somehow (either find out how the page gets it, or just fetch it from within the article

itself)

    Using the devtools you have access to all client-side scripts. So by digging you can find that that link to

/get_comments/ is kept within a javascript object named YUI. You can then try to understand how it is making the

request, and try to emulate that (though you can probably figure it out yourself)

    You might need to overcome some security measures. For example, you might need a session-key from the original

article before you can access the comments. This is used to prevent direct access to some parts of the sites. I won't

trouble you with the details, because it doesn't seem like a problem in this case, but you do need to be aware of it in

case it shows up.

    Finally, you'll have to parse the JSON object (python has excellent built-in tools for that) and then parse the

html comments you are getting (for which you might want to check out BeautifulSoup).

As you can see, this will require some work, but despite all I've written, it's not an extremely complicated task

either.

So don't panic.

It's just a matter of digging and digging until you find gold (also, having some basic WEB knowledge doesn't hurt).

Then, if you face a roadblock and really can't go any further, come back here to SO, and ask again. Someone will help

you.


Source: http://stackoverflow.com/questions/20218855/web-data-scraping-online-news-comments-with-scrapy-python

Saturday, 6 September 2014

A good web data extraction/screen scraper program?


I need to capture product data from a site on a regular basis and wondered if any one knows of a good software program? I've trialed Mozenda but its a monthly subscription and pricey in the long term. Obviously something thats free would be best but I don't mind paying either. Just need a decent program thats reliable and doesn't require much programming knowledge.

You can try ScraperWiki.com if you know python.

I've experimented with Screen-Scraper and found it easy to use. The application comes in multiple versions: basic (which is free), professional, and enterprise. Also, multiple platforms are supported.

Hire a programmer to do it so that there is only a one off cost. I often see similar projects on freelancing websites like Elance and oDesk.

I really like iMacros. You can give it a test drive to see if it meets your needs with the totally free Firefox extension (there's also IE versions), but there are also more full featured application and "server" versions that have more features and ability to do thing in an unattended manner.

Here are some other alternatives to consider:

    License the data from the provider. Call em up and ask 'em.

    Use Amazon Mechanical Turk to get humans to copy and paste and format it for ya. They are cheap.

    For automation, it depends on how complicated the HTML is and how often it changes. You could use Excel's Web Data Import if it's really simple.


You can use irobot from IRobotSoft, which is totally free, and provides more functionalityies than other paid software. Watch demos here http://irobotsoft.com/help/ for how simple it is.

Questions on their forum were answered very quickly.


Source: http://stackoverflow.com/questions/2334164/a-good-web-data-extraction-screen-scraper-program

Friday, 5 September 2014

How to login to website and extract data using PHP [closed]

I have installed the tiny tiny rss on to my computer (Windows) and also have Xampp installed (localhost).

I want to be able to use PHP to extract data from the Tiny tiny RSS webpage.

I have tried this it which just opens the front page:

<?php
$homepage = file_get_contents('my install tiny tiny rss url');
echo $homepage;
?>

But how do I login and extract the data.

You can use cURL to send post data and headers. To login you need to replicate the exact data exchange between the client and the server.


SOurce: http://stackoverflow.com/questions/20611918/how-to-login-to-website-and-extract-data-using-php

Is it ok to scrape data from Google results?

I'd like to fetch results from Google using curl to detect potential duplicate content. Is there a high risk of being banned by Google?

Google will eventually block your IP when you exceed a certain amount of requests.



Google disallows automated access in their TOS, so if you accept their terms you would break them.

That said, I know of no lawsuit from Google against a scraper. Even Microsoft scraped Google, they powered their search engine Bing with it. They got caught in 2011 red handed :)

There are two options to scrape Google results:

1) Use their API

    You can issue around 40 requests per hour You are limited to what they give you, it's not really useful if you want to track ranking positions or what a real user would see. That's something you are not allowed to gather.

    If you want a higher amount of API requests you need to pay.
    60 requests per hour cost 2000 USD per year, more queries require a custom deal.

2) Scrape the normal result pages

    Here comes the tricky part. It is possible to scrape the normal result pages. Google does not allow it.
    If you scrape at a rate higher than 15 keyword requests per hour you risk detection, higher than 20/h will get you blocked from my experience.
    By using multiple IPs you can up the rate, so with 100 IP addresses you can scrape up to 2000 requests per hour. (50k a day)
    There is an open source search engine scraper written in PHP at http://scraping.compunect.com It allows to reliable scrape Google, parses the results properly and manages IP addresses, delays, etc. So if you can use PHP it's a nice kickstart, otherwise the code will still be useful to learn how it is done.


Source: http://stackoverflow.com/questions/22657548/is-it-ok-to-scrape-data-from-google-results

Thursday, 4 September 2014

Data Scraping from PDF and Excel


I am doing a little data scraping, There are 3 types of file from which i am scraping data.

1- HTML
2- PDF
3- Excel(xls)

For HTML i am comfortable, i am using HTML Agility for that.

For PDF and excel i need suggestions from anyone.



Concerning Excel. If you are in a MS environment you can either do Office Automation or use OLEDB. In a Java

environment look at Apache POI.

EDIT: Concerning PDF in Java try Apache PDFBox . Can also work in .NET using IKVM

I can recommend Cogniview's PDF2XL, a reasonably inexpensive commercial product, to extract data from tables in PDF

files into Excel. We have used it with great success.

HTML Agility is a library. Its good to use. But then, why do you need separate tools for different data extraction

purposes? Use Automation Anywhere to extract data from any source. As far as I know, it would work for all the three

sources you have specified. Google it.

Source: http://stackoverflow.com/questions/3147803/data-scraping-from-pdf-and-excel

Wednesday, 3 September 2014

Excel VBA Data Mining Real-Time Data from a Web Page that Refreshes Data

I want to capture real-time data that updates into a table on a webpage; I prefer capturing it into excel using VBA, but I will write it in .NET C# or VB if I that is easier.

the data updates about 1 or 2 seconds, and I want to just grab the latest data quotes and log it into my spreadsheet; the table names are the same, only the data refreshes, and it does so automatically on the web page.

I've done a lot of Excel VBA and I know how to download a URL to a file--this is NOT what I want; I want to gain access to my webpage that is active and grab the data updates after I've logged into my site and selected a webpage that I like.

Is there a simple way to access this data on the webpage from Excel or .Net? Because it refreshes no more than once every 1 or 2 seconds, it is easy to just keep checking it for updates, and I can compare the latest data to see if it actually refreshed.


In Excel 2003, use Data/Import External Data/New Web Query
Browse to your page and select the table you want to import.
After that you can either do a manual Refresh, or use a timer procedure to do something like:

Source: http://stackoverflow.com/questions/9855794/excel-vba-data-mining-real-time-data-from-a-web-page-that-refreshes-data

Tuesday, 2 September 2014

Need to pull data from a website…web query? macro?


I have a list of every DOT # (Dept. of Trans.) in the country. I want to find out insurance effective date for each one of these companies. If you go to http://li-public.fmcsa.dot.gov --> "continue" --> then from the dropdown select "carrier search" and hit "go" it'll take you to a search form (that is the only way to get to this screen).

From there, you can input a DOT # X (use 61222 as an example) and it'll bring you to another screen. Click "view report in HTML" and then down on the bottom you'll see "Active/Pending Insurance". I want to pull the "effective date" from that page and stick it in the spreadsheet next to the DOT # X that I already know.

Of the thousands of DOT #'s in my list, not all will have filings on this website, if that makes a difference.

Can this be done with a Macro or Excel Web Query? I know I probably sound like a total novice, but I'd appreciate any help I could get.

Can you do it? Frankly even if you could you'd lock up the spreadsheet while it's doing that processing. And in the end, how would you handle an error half-way through?

I'd not do this in a client-facing application. This sounds more like something to do in server-side app that can do the processing and gather the information in a more controlled environment. Then you Excel spreadsheet could query that app and get the information in one fell swoop. Error handling is much simpler and you don't end up sitting there staring at Excel why it works its way through thousands of web sites. It was not built to do that elegantly.

What do you write the web service I'm describing in? Well it depends on your preference. Me, I'd write it in Ruby on Rails since it can easily handle the scraping aspect of the task and can report the data out easily as well. But it really falls back to whatever you're most comfortable coding in.


Source: http://stackoverflow.com/questions/15286429/need-to-pull-data-from-a-website-web-query-macro

How to extract data from web 2.0 graphs using a scraper


I have recently come across a web page containing a graph object that displays the (x, y) values on the object as the

mouse is rolled across it. Is there any way to automate the extraction of this data?

How is the graph data loaded? If embedded in the page source then you can extract it with xpath or regex. Else use

Firebug to see how it is loaded.



You will need a solution that works inside the web browser, so the AJAX/Javascript is properly rendered.

I have used iMacros with good success for web scraping in the past. There are free/open-source and "PRO" paid editions

(comparison table here).

Another option is always to custom code something with the Microsoft webbrowser control.


Source: http://stackoverflow.com/questions/3980774/how-to-extract-data-from-web-2-0-graphs-using-a-scraper

Legality of Web Scraping vs Normal Use


I know the topic of web scraping has been discussed before (example), and I understand it's a bit of a grey area

depending on a lot of factors (e.g. website's terms of use).

What I'd like to ask is: how is web scraping any different from (a) how we access the webpage via a web browser, and

(b) how web crawlers (e.g. Google) download and index webpages?

Without knowing the legal background, I can't help but think that they're all just HTTP requests. If web scraping is

illegal, then so should crawling and indexing (for instance be illegal).

Of course if your program is hitting the server so hard that it causes a denial of service, it's a different story

altogether... my point is simply accessing and using data that is already open to the public.



I know this is a dead thread, but it would be nice to place some legal implications here due to its ranking in my

Google Search. I cannot help but figure I am not the only one who searches like I do.

Legally, in the US, there are a few factors that seem to be important.

    Are you doing anything that is akin to hacking or gaining unauthorized access via the Computer Fraud and Abuse Act.

Exploiting vulnerabilities and passing SQL in the URL to open a database no matter how bad the idiot programming like

that was is illegal with a 15 year sentence (see the cases where an individual exploited security vulnerabilities in

Verizon). Also, add a time out even if you round robin or use proxies. DDoS attacks are attacks. 1000 requests per

second can shut down a lot of servers providing public information. The result here is up to 15 years in jail.

    Copyright Law: As mentioned, pure replication of data is illegal. Even 4% replication has been deemed a breach.

With the recent gutting of the DMCA, a person is even more vulnerable to civil and criminal penalties.

    Trespass and Chattels: The following from wikipedia says it all.

    U.S. courts have acknowledged that users of "scrapers" or "robots" may be held liable for committing trespass to

chattels,[5][6] which involves a computer system itself being considered personal property upon which the user of a

scraper is trespassing. The best known of these cases, eBay v. Bidder's Edge, resulted in an injunction ordering

Bidder's Edge to stop accessing, collecting, and indexing auctions from the eBay web site.

    Paywalls and Product: When going behind paywalls and breaching contract by clicking an agreement not to do

something and then doing it, you add fuel to the protection of negligence v. willingness [an issue for damages and

penalties not guilt] in civil and any criminal trials. (sorry originally wanted to say ignorance but it really isn't a

defense)

    International: EU law and other law is way more lax. Corporations with big budgets dominate our legal landscape.

They control the system in a very real way with their $$$.

Basically, get public information and information that is available without going behind a pay wall. Think like a user

of the internet and combine a bunch of sources into a unique product. Don't just 'steal' an entire site (it isn't

really stealing if it is a government site that offers public data especially for download but is if you download all

or even more than a couple of the listings on ebay). Read the terms and conditions to know who actually owns the

content.

Here are a few examples. Trulia owns its information but you could use it to go to an agents website and collect a

legal amount of information. The legal amount is determinable. However, a public MLS listing lookup site with no

agreement or terms and offering data to the public is fair game. The MLS numbers lists, however, are normally not fair

game.

If a researcher can get to data, so can you. If a researcher needs permission, so do you. A computer is like having a

million corporate researchers at your disposal.

AS for company policy, it is usually used internally to shield from liability and serves as a warning but is not

entirely enforceable. The legal parts letting you know about copyrights and such are and usually are supposed to be

known by everyone. Complete ignorance is not a legal protection. It does provide a ground set of rules. Be nice, or get

banned is that message as far as I know.

My personal strategy is to start with public data and embellish it within legal means.


Source: http://stackoverflow.com/questions/14735791/legality-of-web-scraping-vs-normal-use