Friday, 3 July 2015

Mobile app developers “duped” into distributing data-scraping malware: NICTA

The surge in mobile malware has led many to condemn developers' poor security practices, yet recent NICTA research suggests that – even though data-stealing is ubiquitous among both paid and free Android applications – many mobile application developers are in fact being “duped” into incorporating data-stealing routines into their applications.

A methodical analysis of Android applications and source code found that all of the top 100 paid and non-paid apps in Australia were collecting personal information, with 60 percent of the apps incorporating some sort of tracking library and 20 percent of the apps featuring more than three different tracking libraries.

While many have blamed developers for their poor security, NICTA mobile systems research group leader, Aruna Seneviratne, who leads the organisation's Networks Research Group, told CSO Australia that many tracking libraries were inadvertently added when developers incorporated third-party libraries into their mobile apps.

“In most cases app developers just use third-party libraries and don't know what's in them,” he said. “They're not being malicious for the sake of being malicious; they are just being duped into doing a thing that collects a lot of information.”

 And collect they do. Apps analysed by the team – whose paper 'early detection of spam mobile apps' was accepted for presentation at the recent WWW 2015 conference in Florence, Italy – were siphoning all kinds of personal information off of users' mobile devices, often sending it to enlarge what have become massive databases of personal preferences and behavioural modeling.

“It's amazing how much information each of those apps collects,” he said, “and the scary thing is that most of them actually go to a small number of sources – which means these guys can actually infer a lot of information about you. They have a very good idea of who you are and what you're doing – and they are cross-matching the information they collect.”

Ever more-clever data-siphoning routines were making data collection richer all the time, with many Android apps now being designed with libraries that collect information about nearby Wi-Fi access points and can correctly extrapolate the user's location 90 percent of the time.

Read more: The week in security: Android apps collecting your location data, home routers hit by drive-by malware

Seneviratne blamed Google's relatively lax app-approval process for the proliferation of such apps, which join the malware-laden apps that by the team's figures account for around 3 percent of all Google Play Store apps.

Recognising that developers are often as clueless as users about the extent of the data collection going on, the team has proposed an app-rating system that will give consumers a better idea of what they're enabling by downloading and installing a particular app.

A basic prototype has already been developed and a pilot site is expected to be up and running by the fourth quarter of this year. The service, which rates apps on criteria such as privacy and security, will be available to third parties as a Web service that Seneviratne hopes will eventually help it gain traction on app-rating and other sites.

Read more: Surveillance laws driving companies to limit data collection, developers to boost security

“We've been working to come up with a scheme that is similar to the energy-ratings system that you have for electrical appliances,” he said, noting that the site will also seek to boost developers' security awareness by correlating app ratings “to let consumers know they can download an alternate app that has the same functionality but a higher security rating”.

Israeli developer-tools firm Checkmarx has taken its own approach to improving developers' security skills, recently learning extensive lessons as hackers worked to manipulate its Game of Hacks security application – which is now under development to be sold to large corporates for developer training and testing.

This article is brought to you by Enex TestLab, content directors for CSO Australia.

Read more: The week in security: Budget flags encryption troubles, cross-government IAM

Feeling social? Follow us on Twitter and LinkedIn Now!

Read More:

    Victorian Commissioner for Privacy and Data Protection sorts sheep from the goats

    Better than email: VISA launches FireEye threat intel platform for merchants

Source: http://www.cso.com.au/article/576533/mobile-app-developers-duped-into-distributing-data-scraping-malware-nicta/

Wednesday, 24 June 2015

Data Scraping - What Are Hand-Scraped Hardwood Floors and What Are the Benefits?

If you love the look of hardwood flooring with lots of character, then you may want to check out hand-scraped hardwood flooring. Hand-scraped wood provides a warm vintage look, providing the floor instant character. These types of scraped hardwoods are suitable for living rooms, dining rooms, hallways and bedrooms. But what exactly is hand-scraped hardwood flooring?

Well, it is literally what you think it is. Hand-scraped hardwood flooring is created by hand using specialized wood working tools to make each board unique and giving an overall "old worn" appearance.

At Innovation Builders we offer solid wood floors finished on site with an actual hand-scraping technique followed by stain and sealer. Solid wood floors are installed by an expert team of technicians who work each board with skilled craftsman-like attention to detail. Following the scraping procedure the floor is stained by hand with a customer selected stain color, and then protected with multiple coats of sealing and finishing polyurethane. This finishing process of staining, sealing and coating the wood floors contributes to providing the look and durability of an old reclaimed wood floor, but with today's tough, urethane finishes.

There are many, many benefits to hand-scraped wood flooring. Overall, these floors are extremely durable and hard wearing, providing years of trouble-free use. These wood floors remain looking newer for longer because the texture that the process provides hides the typical dents, dings and scratches that other floors can't hide so easily. That's great news for households with kids, dogs, and cats.

These types of wood flooring have another unique advantage as well. When you do scratch these floors during their lifetime, the scratches are easily repaired. As long as the scratch isn't too deep you can make them practically disappear without ever having to hire a professional. It's simple to hide the scratch by using a color-matched stain marker or repair kit that is readily available through local flooring distributors. These features make hand-scraped hardwood flooring a lot more durable and hassle-free to maintain than other types of wood flooring.

The expert processes utilized in the creation of these floors provides a custom look of worn wood with deep color and subtle highlights. When the light hits the wood at different times during the day, it provides an understated but powerful effect of depth and beauty. They instantly offer your rooms a rustic look full of character, allowing your home to become a warm and inviting environment. The rustic look of this wood provides a texture, style and rustic appeal that cannot be matched by any other type of flooring.

Hand-Scraped Hardwood Flooring is a floor that says welcome and adds a touch of elegance to any home. If you are looking to buy a new home and you haven't had the opportunity to see or feel hand scraped hardwoods, stop in any of the model homes at Innovation Builders in Keller, North Richland Hills or Grand Prairie, Texas and check it out!

Source: http://ezinearticles.com/?What-Are-Hand-Scraped-Hardwood-Floors-and-What-Are-the-Benefits?&id=6026646

Friday, 19 June 2015

Web Scraping: working with APIs

APIs present researchers with a diverse set of data sources through a standardised access mechanism: send a pasted together HTTP request, receive JSON or XML in return. Today we tap into a range of APIs to get comfortable sending queries and processing responses.

These are the slides from the final class in Web Scraping through R: Web scraping for the humanities and social sciences

This week we explore how to use APIs in R, focusing on the Google Maps API. We then attempt to transfer this approach to query the Yandex Maps API. Finally, the practice section includes examples of working with the YouTube V2 API, a few ‘social’ APIs such as LinkedIn and Twitter, as well as APIs less off the beaten track (Cricket scores, anyone?).

I enjoyed teaching this course and hope to repeat and improve on it next year. When designing the course I tried to cram in everything I wish I had been taught early on in my PhD (resulting in information overload, I fear). Still, hopefully it has been useful to students getting started with digital data collection, showing on the one hand what is possible, and on the other giving some idea of key steps in achieving research objectives.

Download the .Rpres file to use in Rstudio here

A regular R script with code-snippets only can be accessed here

Slides from the first session here

Slides from the second session here

Slides from the third session here

Source: http://www.r-bloggers.com/web-scraping-working-with-apis/

Monday, 8 June 2015

Scraping Services - Assuring Scraping Success with Proxy Data Scraping

Have you ever heard of "Data Scraping?" Data Scraping is the process of collecting useful data that has been placed in the public domain of the internet (private areas too if conditions are met) and storing it in databases or spreadsheets for later use in various applications. Data Scraping technology is not new and many a successful businessman has made his fortune by taking advantage of data scraping technology.

Sometimes website owners may not derive much pleasure from automated harvesting of their data. Webmasters have learned to disallow web scrapers access to their websites by using tools or methods that block certain ip addresses from retrieving website content. Data scrapers are left with the choice to either target a different website, or to move the harvesting script from computer to computer using a different IP address each time and extract as much data as possible until all of the scraper's computers are eventually blocked.

Thankfully there is a modern solution to this problem. Proxy Data Scraping technology solves the problem by using proxy IP addresses. Every time your data scraping program executes an extraction from a website, the website thinks it is coming from a different IP address. To the website owner, proxy data scraping simply looks like a short period of increased traffic from all around the world. They have very limited and tedious ways of blocking such a script but more importantly -- most of the time, they simply won't know they are being scraped.

You may now be asking yourself, "Where can I get Proxy Data Scraping Technology for my project?" The "do-it-yourself" solution is, rather unfortunately, not simple at all. Setting up a proxy data scraping network takes a lot of time and requires that you either own a bunch of IP addresses and suitable servers to be used as proxies, not to mention the IT guru you need to get everything configured properly. You could consider renting proxy servers from select hosting providers, but that option tends to be quite pricey but arguably better than the alternative: dangerous and unreliable (but free) public proxy servers.

There are literally thousands of free proxy servers located around the globe that are simple enough to use. The trick however is finding them. Many sites list hundreds of servers, but locating one that is working, open, and supports the type of protocols you need can be a lesson in persistence, trial, and error. However if you do succeed in discovering a pool of working public proxies, there are still inherent dangers of using them. First off, you don't know who the server belongs to or what activities are going on elsewhere on the server. Sending sensitive requests or data through a public proxy is a bad idea. It is fairly easy for a proxy server to capture any information you send through it or that it sends back to you. If you choose the public proxy method, make sure you never send any transaction through that might compromise you or anyone else in case disreputable people are made aware of the data.

A less risky scenario for proxy data scraping is to rent a rotating proxy connection that cycles through a large number of private IP addresses. There are several of these companies available that claim to delete all web traffic logs which allows you to anonymously harvest the web with minimal threat of reprisal. Companies such as offer large scale anonymous proxy solutions, but often carry a fairly hefty setup fee to get you going.

The other advantage is that companies who own such networks can often help you design and implementation of a custom proxy data scraping program instead of trying to work with a generic scraping bot. After performing a simple Google search, I quickly found one company (www.ScrapeGoat.com) that provides anonymous proxy server access for data scraping purposes. Or, according to their website, if you want to make your life even easier, ScrapeGoat can extract the data for you and deliver it in a variety of different formats often before you could even finish configuring your off the shelf data scraping program.

Whichever path you choose for your proxy data scraping needs, don't let a few simple tricks thwart you from accessing all the wonderful information stored on the world wide web!

Source: http://ezinearticles.com/?Assuring-Scraping-Success-with-Proxy-Data-Scraping&id=248993

Tuesday, 2 June 2015

Scraping the Royal Society membership list

To a data scientist any data is fair game, from my interest in the history of science I came across the membership records of the Royal Society from 1660 to 2007 which are available as a single PDF file. I’ve scraped the membership list before: the first time around I wrote a C# application which parsed a plain text file which I had made from the original PDF using an online converting service, looking back at the code it is fiendishly complicated and cluttered by boilerplate code required to build a GUI. ScraperWiki includes a pdftoxml function so I thought I’d see if this would make the process of parsing easier, and compare the ScraperWiki experience more widely with my earlier scraper.

The membership list is laid out quite simply, as shown in the image below, each member (or Fellow) record spans two lines with the member name in the left most column on the first line and information on their birth date and the day they died, the class of their Fellowship and their election date on the second line.

Later in the document we find that information on the Presidents of the Royal Society is found on the same line as the Fellow name and that Royal Patrons are formatted a little differently. There are also alias records where the second line points to the primary record for the name on the first line.

pdftoxml converts a PDF into an xml file, wherein each piece of text is located on the page using spatial coordinates, an individual line looks like this:

<text top="243" left="135" width="221" height="14" font="2">Abbot, Charles, 1st Baron Colchester </text>

This makes parsing columnar data straightforward you simply need to select elements with particular values of the “left” attribute. It turns out that the columns are not in exactly the same positions throughout the whole document, which appears to have been constructed by tacking together the membership list A-J with that of K-Z, but this can easily be resolved by accepting a small range of positions for each column.

Attempting to automatically parse all 395 pages of the document reveals some transcription errors: one Fellow was apparently elected on 16th March 197 – a bit of Googling reveals that the real date is 16th March 1978. Another fellow is classed as a “Felllow”, and whilst most of the dates of birth and death are separated by a dash some are separated by an en dash which as far as the code is concerned is something completely different and so on. In my earlier iteration I missed some of these quirks or fixed them by editing the converted text file. These variations suggest that the source document was typed manually rather than being output from a pre-existing database. Since I couldn’t edit the source document I was obliged to code around these quirks.

ScraperWiki helpfully makes putting data into a SQLite database the simplest option for a scraper. My handling of dates in this version of the scraper is a little unsatisfactory: presidential terms are described in terms of a start and end year but are rendered 1st January of those years in the database. Furthermore, in historical documents dates may not be known accurately so someone may have a birth date described as “circa 1782″ or “c 1782″, even more vaguely they may be described as having “flourished 1663-1778″ or “fl. 1663-1778″. Python’s default datetime module does not capture this subtlety and if it did the database used to store dates would need to support it too to be useful – I’ve addressed this by storing the original life span data as text so that it can be analysed should the need arise. Storing dates as proper dates in the database, rather than text strings means we can query the database using date based queries.

ScraperWiki provides an API to my dataset so that I can query it using SQL, and since it is public anyone else can do this too. So, for example, it’s easy to write queries that tell you the the database contains 8019 Fellows, 56 Presidents, 387 born before 1700, 3657 with no birth date, 2360 with no death date, 204 “flourished”, 450 have birth dates “circa” some year.

I can count the number of classes of fellows:

select distinct class,count(*) from `RoyalSocietyFellows` group by class

Make a table of all of the Presidents of the Royal Society

select * from `RoyalSocietyFellows` where StartPresident not null order by StartPresident desc

…and so on. These illustrations just use the ScraperWiki htmltable export option to display the data as a table but equally I could use similar queries to pull data into a visualisation.

Comparing this to my earlier experience, the benefits of using ScraperWiki are:

•    Nice traceable code to provide a provenance for the dataset;

•    Access to the pdftoxml library;

•    Strong encouragement to “do the right thing” and put the data into a database;

•    Publication of the data;

•    A simple API giving access to the data for reuse by all.

My next target for ScraperWiki may well be the membership lists for the French Academie des Sciences, a task which proved too complex for a simple plain text scraper…

Source: https://scraperwiki.wordpress.com/2012/12/28/scraping-the-royal-society-membership-list/

Thursday, 28 May 2015

Web Scraping Services - A trending technique in data science!!!

Web scraping as a market segment is trending to be an emerging technique in data science to become an integral part of many businesses – sometimes whole companies are formed based on web scraping. Web scraping and extraction of relevant data gives businesses an insight into market trends, competition, potential customers, business performance etc.  Now question is that “what is actually web scraping and where is it used???” Let us explore web scraping, web data extraction, web mining/data mining or screen scraping in details.

What is Web Scraping?

Web Data Scraping is a great technique of extracting unstructured data from the websites and transforming that data into structured data that can be stored and analyzed in a database. Web Scraping is also known as web data extraction, web data scraping, web harvesting or screen scraping.

What you can see on the web that can be extracted. Extracting targeted information from websites assists you to take effective decisions in your business.

Web scraping is a form of data mining. The overall goal of the web scraping process is to extract information from a websites and transform it into an understandable structure like spreadsheets, database or csv. Data like item pricing, stock pricing, different reports, market pricing, product details, business leads can be gathered via web scraping efforts.

There are countless uses and potential scenarios, either business oriented or non-profit. Public institutions, companies and organizations, entrepreneurs, professionals etc. generate an enormous amount of information/data every day.

Uses of Web Scraping:

The following are some of the uses of web scraping:

•    Collect data from real estate listing

•    Collecting retailer sites data on daily basis

•    Extracting offers and discounts from a website.

•    Scraping job posting.

•    Price monitoring with competitors.

•    Gathering leads from online business directories – directory scraping

•    Keywords research

•    Gathering targeted emails for email marketing – email scraping

•    And many more.

There are various techniques used for data gathering as listed below:

•    Human copy-and-paste – takes lot of time to finish when data is huge

•    Programming the Custom Web Scraper as per the needs.

•    Using Web Scraping Softwares available in market.

Are you in search of web data scraping expert or specialist. Then you are at right place. We are the team of web scraping experts who could easily extract data from website and further structure the unstructured useful data to uncover patterns, and help businesses for decision making that helps in increasing sales, cover a wide customer base and ultimately it leads to business towards growth and success.

We have got expertise in all the web scraping techniques, scraping data from ajax enabled complex websites, bypassing CAPTCHAs, forming anonymous http request etc in providing web scraping services.

The web scraping is legal since the data is publicly and freely available on the Web. Smart WebTech can probably help you to achieve your scraping-based project goals. We would be more than happy to hear from you.

Source: http://webdata-scraping.com/web-scraping-trending-technique-in-data-science/

Tuesday, 26 May 2015

Data Extraction Services

Are you finding it tedious to perform your routine tasks as well as finding time to research for some information? Don't worry; all you have to do is outsource data extraction requirements to reliable service providers such as Hi-Tech BPO Services.

We can assist you in finding, extracting, gathering, processing and validating all the required data through our effective data extraction services. We can extract data from any given source such as websites, databases, printed documents, directories, etc.

With a whole plethora of data extraction services solutions; we are definitely a one stop solution to all your data extraction services requirements.

For utilizing our data extraction services, all you have to do is outsource data extraction requirements to us, and we will create effective strategies and extract the required data from all preferred sources. Then we will arrange all the extracted data in a systematic order.

Types of data extraction services provided by our data extraction India unit:

The data extraction India unit of Hi-Tech BPO Services can attend to all types of outsource data extraction requirements. Following are just some of the data extraction services we have delivered:

•    Data extraction from websites
•    Data extraction from databases
•    Extraction of data from directories
•    Extracting data from books
•    Data extraction from forms
•    Extracting data from printed materials

Features of Our Data Extraction Services:

•    Reliable collection of resources for data extraction
•    Extensive range of data extraction services
•    Data can be extracted from any available source be it a digital source or a hard copy source
•    Proper researching, extraction, gathering, processing and validation of data
•    Reasonably priced data extraction services
•    Quality and confidentiality ensured through various strict measures

Our data extraction India unit has the competency to handle any of your data extraction services requirements. Just provide us with your specific requirements and we will extract data accordingly from your preferred resources, if particularly specified. Otherwise we will completely rely on our collection of resources for extracting data for you.

Source: http://www.hitechbposervices.com/data-extraction.php

Monday, 25 May 2015

What you need to know about web scraping: How to understand, identify, and sometimes stop

NB: This is a gust article by Rami Essaid, co-founder and CEO of Distil Networks.

Here’s the thing about web scraping in the travel industry: everyone knows it exists but few know the details.

Details like how does web scraping happen and how will I know? Is web scraping just part of doing business online, or can it be stopped? And lastly, if web scraping can be stopped, should it always be stopped?

These questions and the challenge of web scraping are relevant to every player in the travel industry. Travel suppliers, OTAs and meta search sites are all being scraped. We have the data to prove it; over 30% of travel industry website visitors are web scrapers.

Google Analytics, and most other analytics tools do not automatically remove web scraper traffic, also called “bot” traffic, from your reports – so how would you know this non-human and potentially harmful traffic exists? You have to look for it.

This is a good time to note that I am CEO of a bot-blocking company called Distil Networks, and we serve the travel industry as well as digital publishers and eCommerce sites to protect against web scraping and data theft – we’re on a mission to make the web more secure.

So I am admittedly biased, but will do my best to provide an educational account of what we’ve learned to be true about web scraping in travel – and why this is an issue every travel company should at the very least be knowledgeable about.

Overall, I see an alarming lack of awareness around the prevalence of web scraping and bots in travel, and I see confusion around what to do about it. As we talk this through I’ll explain what these “bots” are, how to find them and how to manage them to better protect and leverage your travel business.

What are bots, web scrapers and site indexers? Which are good and which are bad?

The jargon around web scraping is confusing – bots, web scrapers, data extractors, price scrapers, site indexers and more – what’s the difference? Allow me to quickly clarify.

–> Bots: This is a general term that refers to non-human traffic, or robot traffic that is computer generated. Bots are essentially a line of code or a program that is created to perform specific tasks on a large scale.  Bots can include web scrapers, site indexers and fraud bots. Bots can be good or bad.

–> Web Scraper: (web harvesting or web data extraction) is a computer software technique of extracting information from websites (source, Wikipedia). Web scrapers are usually bad.

If your travel website is being scraped, it is most likely your competitors are collecting competitive intelligence on your prices. Some companies are even built to scrape and report on competitive price as a service. This is difficult to prove, but based on a recent Distil Networks study, prices seem to be main target.You can see more details of the study and infographic here.

One case study is Ryanair. They have been particularly unhappy about web scraping and won a lawsuit against a German company in 2008, incorporated Captcha in 2011 to stop new scrapers, and when Captcha wasn’t totally effective and Cheaptickets was still scraping, they took to the courts once again.

So Ryanair is doing what seems to be a consistent job of fending off web scrapers – at least after the scraping is performed. Unfortunately, the amount of time and energy that goes into identifying and stopping web scraping after the fact is very high, and usually this means the damage has been done.

This type of web scraping is bad because:

    Your competition is likely collecting your price data for competitive intelligence.

    Other travel companies are collecting your flights for resale without your consent.

    Identifying this type of web scraping requires a lot of time and energy, and stopping them generally requires a lot more.

Web scrapers are sometimes good

Sometimes a web scraper is a potential partner in disguise.

Meta search sites like Hipmunk sometimes get their start by scraping travel site data. Once they have enough data and enough traffic to be valuable they go to suppliers and OTAs with a partnership agreement. I’m naming Hipmunk because the Company is one of th+e few to fess up to site scraping, and one of the few who claim to have quickly stopped scraping when asked.

I’d wager that Hipmunk and others use(d) web scraping because it’s easy, and getting a decision maker at a major travel supplier on the phone is not easy, and finding legitimate channels to acquire supplier data is most definitely not easy.

I’m not saying you should allow this type of site scraping – you shouldn’t. But you should acknowledge the opportunity and create a proper channel for data sharing. And when you send your cease and desist notices to tell scrapers to stop their dirty work, also consider including a note for potential partners and indicate proper channels to request data access.

–> Site Indexer: Good.

Google, Bing and other search sites send site indexer bots all over the web to scour and prioritize content. You want to ensure your strategy includes site indexer access. Bing has long indexed travel suppliers and provided inventory links directly in search results, and recently Google has followed suit.

–> Fraud Bot: Always bad.

Fraud bots look for vulnerabilities and take advantage of your systems; these are the pesky and expensive hackers that game websites by falsely filling in forms, clicking ads, and looking for other vulnerabilities on your site. Reviews sections are a common attack vector for these types of bots.

How to identify and block bad bots and web scrapers

Now that you know the difference between good and bad web scrapers and bots, how do you identify them and how do you stop the bad ones? The first thing to do is incorporate bot-identification into your website security program. There are a number of ways to do this.

In-house

When building an in house solution, it is important to understand that fighting off bots is an arms race. Every day web scraping technology evolves and new bots are written. To have an effective solution, you need a dynamic strategy that is always adapting.

When considering in-house solutions, here are a few common tactics:

    CAPTCHAs – Completely Automated Public Turing Tests to Tell Computers and Humans Apart (CAPTCHA), exist to ensure that user input has not been generated by a computer. This has been the most common method deployed because it is simple to integrate and can be effective, at least at first. The problem is that Captcha’s can be beaten with a little workand more importantly, they are a nuisance to end usersthat can lead to a loss of business.

    Rate Limiting- Advanced scraping utilities are very adept at mimicking normal browsing behavior but most hastily written scripts are not. Bots will follow links and make web requests at a much more frequent, and consistent, rate than normal human users. Limiting IP’s that make several requests per second would be able to catch basic bot behavior.

    IP Blacklists - Subscribing to lists of known botnets & anonymous proxies and uploading them to your firewall access control list will give you a baseline of protection. A good number of scrapers employ botnets and Tor nodes to hide their true location and identity. Always maintain an active blacklist that contains the IP addresses of known scrapers and botnets as well as Tor nodes.

    Add-on Modules – Many companies already own hardware that offers some layer of security. Now, many of those hardware providers are also offering additional modules to try and combat bot attacks. As many companies move more of their services off premise, leveraging cloud hosting and CDN providers, the market share for this type of solution is shrinking.

    It is also important to note that these types of solutions are a good baseline but should not be expected to stop all bots. After all, this is not the core competency of the hardware you are buying, but a mere plugin.

Some example providers are:

    Impreva SecureSphere- Imperva offers Web Application Firewalls, or WAF’s. This is an appliance that applies a set of rules to an HTTP connection. Generally, these rules cover common attacks such as Cross-site Scripting (XSS) and SQL Injection. By customizing the rules to your application, many attacks can be identified and blocked. The effort to perform this customization can be significant and needs to be maintained as the application is modified.

    F5 – ASM – F5 offers many modules on their BigIP load balancers, one of which is the ASM. This module adds WAF functionality directly into the load balancer. Additionally, F5 has added policy-based web application security protection.

Software-as-a-service

There are website security software options that include, and sometimes specialize in web scraping protection. This type of solution, from my perspective, is the most effective path.

The SaaS model allows someone else to manage the problem for you and respond with more efficiency even as new threats evolve.  Again, I’m admittedly biased as I co-founded Distil Networks.

When shopping for a SaaS solution to protect against web scraping, you should consider some of the following factors:

•    Does the provider update new threats and rules in real time?

•    How does the solution block suspected non-human visitors?

•    Which types of proactive blocking techniques, such as code injections, does the provider deploy?

•    Which of the reactive techniques, such as rate limiting, are used?

•    Does the solution look at all of your traffic or a snapshot?

•    Can the solution block bots before they reach your infrastructure – and your data?

•    What kind of latency does this solution introduce?

I hope you now have a clearer understanding of web scraping and why it has become so prevalent in travel, and even more important, what you should do to protect and leverage these occurrences.

Source: http://www.tnooz.com/article/what-you-need-to-know-about-web-scraping-how-to-understand-identify-and-sometimes-stop/

Friday, 22 May 2015

Web scraping using Python without using large frameworks like Scrapy

scrapy-big-logoIf you need publicly available data from scraping the Internet, before creating a webscraper, it is best to check if this data is already available from public data sources or APIs. Check the site’s FAQ section or Google for their API endpoints and public data.

Even if their API endpoints are available you have to create some parser for fetching and structuring the data according to your needs.

Scrapy is a well established framework for scraping, but it is also a very heavy framework. For smaller jobs, it may be overkill and for extremely large jobs it is very slow.

So if you would like to roll up your sleeves and build your own scraper, continue reading.

Here are some basic steps performed by most webspiders:

1) Start with a URL and use a HTTP GET or PUT request to access the URL
2) Fetch all the contents in it and parse the data
3) Store the data in any database or put it into any data warehouse
4) Enqueue all the URLs in a page
5) Use the URLs in queue and repeat from process 1
Here are the 3 major modules in every web crawler:
1) Request/Response handler.
2) Data parsing/data cleansing/data munging process.
3) Data serialization/data pipelines.

Lets look at each of these modules and see what they do and how to use them.

Request/Response handler

Request/response handlers are managers who make http requests to a url or a group of urls, and fetch the response objects as html contents and pass this data to the next module. If you use Python for performing request/response url-opening process libraries such as the following are most commonly used

1) urllib(20.5. urllib – Open arbitrary resources by URL – Python v2.7.8 documentation) -Basic python library yet high-level interface for fetching data across the World Wide Web.

2) urllib2(20.6. urllib2 – extensible library for opening URLs – Python v2.7.8 documentation) – extensible library of urllib, which would handle basic http requests, digest authentication, redirections, cookies and more.

3) requests(Requests: HTTP for Humans) – Much advanced request library

which is built on top of basic request handling libraries.

Data parsing/data cleansing/data munging process

This is the module where the fetched data is processed and cleaned. Unstructured data is transformed into structured during this processing. Usually  a set of Regular Expressions (regexes) which perform pattern matching and text processing tasks on the html data are used for this processing.

In addition to regexes, basic string manipulation and search methods are also used to perform this cleaning and transformation. You must have a thorough knowledge of regular expressions and so that you could design the regex patterns.

Data serialization/data pipelines

Once you get the cleaned data from the parsing and cleaning module, the data serialization module will be used to serialize the data according to the data models that you require. This is the final module that will output data in a standard format that can be stored in databases, JSON/CSV files or passed to any data warehouses for storage. These tasks are usually performed by libraries listed below

1) pickle (pickle – Python object serialization) –  This module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure

2) JSON (JSON encoder and decoder)

3) CSV (https://docs.python.org/2/library/csv.html)

4) Basic database interface libraries like pymongo (Tutorial – PyMongo),mysqldb ( on python.org), sqlite3(sqlite3 – DB-API interface for SQLite databases)

And many more such libraries based on the format and database/data storage.

Basic spider rules

The rules to follow while building a spider are to be nice to the sites you are scraping and follow the rules in the site’s spider policies outlined in the site’s robots.txt.

Limit the  number of requests in a second and build enough delays in the spiders so that  you don’t adversely affect the site.

It just makes sense to be nice.

We will cover more techniques in future articles

Source: http://learn.scrapehero.com/webscraping-using-python-without-using-large-frameworks-like-scrapy/

Tuesday, 19 May 2015

Hard-Scraped Hardwood Flooring: Restoration of History

Throughout History hardwood flooring has undergone dramatic changes from the meticulous hard-scraped hardwood polished floors of majestic plantations of the Deep South, to modern day technology providing maintenance free wood flooring designed for comfort and appearance. The hand-scraped hardwood floors of the South, depicted charm with old rustic nature and character that was often associated with this time era. To date, hand-scraped hardwood flooring is being revitalized and used in up-scale homes and places of businesses to restore the old country charm that once faded into oblivion.

As the name implies, hand-scraped flooring involves the retexturing the top layer of flooring material by various methods in an attempts to mimic the rustic appearance of flooring in yesteryears. Depending on the degree of texture required, hand scraping hardwood material is often accomplished by highly skilled craftsmen with specialized tools and years of experience perfecting this procedure. When properly done, hand-scraped hardwood floors add texture, richness and uniqueness not offered in any similar hardwood flooring product.

Rooted with history, these types of floors are available in finished or unfinished surfaces. The majority of the individuals selecting hand-scraped hardwood flooring elect a prefinished floor to reduce costs per square foot in installation and finishing labor charges, allowing for budget guidelines to bend, not break. As expected, hand-scraped flooring is expensive and depending on the grade and finish selected, can range from $15-40$ per square foot and beyond for material only. Preparation of the material is labor intensive adding to the overall cost per square foot dramatically. Recommended professional installation can and often does increase the cost per square foot as well, placing this method of hardwood flooring well out of reach of the average hardwood floor purchaser.

With numerous selections of hand-scraped finishes available, each finish is designed to bring out a different appearance making it a one-of-a-kind work of art. These numerous finish selections include:

• Time worn aged, dark coloring stain application bringing out grain characteristics

• Wire brushed, providing a highlighted "grainy" effect with obvious rough texture

• Hand sculpted, smoother distressed uniform appearance

• French Bleed, staining of edges and side joints with a much darker stain to give a bleeding effect to the wood

• Hand Hewn or Rough Sawn, with visible and noticeable saw marks

Regardless of the selection made, scraped flooring cannot be compared to any other available flooring material based on durability, strength and visual appearance. Limited by only the imagination and creativity, several wood species can be used to create unusual floor patterns, highlighting main focal points of personal libraries and art collections.

The precise process utilized in the creation of scraped floors projects a custom look with deep color and subtle warm highlights. With radiant natural light reflecting off this type of floor, the effect of beauty and depth is radiated in a fashion that fills the room with solitude and serenity encompassing all that enter. Hand-scraped hardwood floors speak of the past, a time of decent, a time or war and ambiguity towards other races and the blood- shed so that all men could be treated as equals. More than exquisite flooring, hand-scraped hardwood flooring is the restoration of History.

Source: http://ezinearticles.com/?Hard-Scraped-Hardwood-Flooring:-Restoration-of-History&id=6333218

Sunday, 17 May 2015

Introducing ScrapeShield: Discover, Defend & Deter Content Scraping

If you're a publisher, whether an individual blogger or major media outlet, you've undoubtedly experienced content scraping. Searching the web for an article you've published or other original content you've created and you find it copied and republished on some other random website. Often the site will be full of ads. And, sometimes, it will even rank higher in search results than your original work.

While you may envision an army of individuals copying and pasting your content on their sites, the truth is content scraping is typically an automated process with bots that grab original content and then republish it without human intervention onto link farm sites. CloudFlare has blocked many of these bots automatically in the past, but we decided it was time to do something to more actively stop them.

Introducing ScrapeShield

ScrapeShield is an app created by the CloudFlare team. It incorporates several existing CloudFlare features like email obfuscation and hotlink protection that serve to protect from content scraping and adds a number of new features as well. Because we believe every publisher of original content should be able to understand and control how their work is used, we're providing ScrapeShield free for every CloudFlare user.

Detect, Defend & Deter

ScrapeShield has different elements to help you detect when your content is scraped, defend your site against content scrapers, and even deter content scrapers from targeting you in the first place. If you enable ScrapeShield, CloudFlare will automatically insert invisible tracking beacons in your content. When automated bots scrape your content, they pull the beacons along with them. CloudFlare detects these beacons when they ping from sites that aren't your own. You can access your ScrapeShield control panel to see where your content is being republished. Not only is this useful in showing scraping, but you can also see users who are reading your content through proxy services like Flipboard or Pulse.

The data from the content beacons is fed back into CloudFlare's protection system. As CloudFlare identifies content scraping bots, we automatically prevent them from accessing your site. Just like Project Honey Pot, the original inspiration for CloudFlare, used traps to detect when spammers were harvesting email addresses, CloudFlare now uses data from ScrapeShield to identify content scrapers and keep them off publishers' sites.

Maze

We didn't want to just stop scrapers from attacking sites on CloudFlare, we also wanted to tie up their resources so they couldn't harm the rest of the web. To do this, we created Maze. Maze routes known content scrapers who are visiting ScrapeShield-protected sites into a virtual labyrinth of gibirish and gobbledygook. We dynamically throttle the bandwidth and speed so instead of the pages loading as fast as possible, the connection is held open to the scrapers and their resources are tied up.

We use excess resources on the CloudFlare network to generate Maze, and it doesn't consume any of our publishers' resources or add any additional load to their sites. What's beautiful about the system is that the only way that content scrapers can be sure they're avoiding Maze is to avoid CloudFlare's IP addresses entirely. For any content scrapers who may be reading this, here's a helpful list of all of our IPs so you can make sure to stay away.

No Pinning

Finally, with the rise of sites like Pinterest, innocent content scraping may become even more prolific. While many sites welcome their images being pinned, we wanted to make it easy to opt out. ScrapeShield includes an option to add the no-pinning meta tag to your site to prevent your images from being pinned to the site. As other similar services include a mechanism to opt out, expect that we'll include an easy way for you to do so right from the ScrapeShield interface.

The health of the web depends on publishers creating original content getting credit for their creations. Cloud Flare is committed to building a better web and we're extremely excited about ScrapeShield as a new tool to help publishers do exactly that.

Source: https://blog.cloudflare.com/introducing-scrapeshield-discover-defend-dete/

Wednesday, 6 May 2015

Web Scraping - Data Collection or Illegal Activity?

Web Scraping Defined

We've all heard the term "web scraping" but what is this thing and why should we really care about it?  Web scraping refers to an application that is programmed to simulate human web surfing by accessing websites on behalf of its "user" and collecting large amounts of data that would typically be difficult for the end user to access.  Web scrapers process the unstructured or semi-structured data pages of targeted websites and convert the data into a structured format.  Once the data is in a structured format, the user can extract or manipulate the data with ease.  Web scraping is very similar to web indexing (used by most search engines), but the end motivation is typically much different.  Whereas web indexing is used to help make search engines more efficient, web scraping is typically used for different reasons like change detection, market research, data monitoring, and in some cases, theft.

Why Web Scrape?

There are lots of reasons people (or companies) want to scrape websites, and there are tons of web scraping applications available today.  A quick Internet search will yield numerous web scraping tools written in just about any programming language you prefer.  In today's information-hungry environment, individuals and companies alike are willing to go to great lengths to gather information about all sorts of topics.  Imagine a company that would really like to gather some market research on one of their leading competitors...might they be tempted to invoke a web scraper that gathers all the information for them?  Or, what if someone wanted to find a vulnerable site that allowed otherwise not-so-free downloads?  Or, maybe a less than honest person might want to find a list of account numbers on a site that failed to properly secure them.  The list goes on and on.

I should mention that web scraping is not always a bad thing.  Some websites allow web scraping, but many do not.  It's important to know what a website allows and prohibits before you scrape it.

The Problem With Web Scraping

Web scraping rides a fine line between collecting information and stealing information.  Most websites have a copyright disclosure statement that legally protects their website information.  It's up to the reader/user/scraper to read these disclosure statements and follow along legally and ethically.  In fact, the F5.com website presents the following copyright disclosure:  "All content included on this site, such as text, graphics, logos, button icons, images, audio clips, and software, including the compilation thereof (meaning the collection, arrangement, and assembly), is the property of F5 Networks, Inc., or its content and software suppliers, except as may be stated otherwise, and is protected by U.S. and international copyright laws."  It goes on to say, "We reserve the right to make changes to our site and these disclaimers, terms, and conditions at any time."

So, scraper beware!  There have been many court cases where web scraping turned into felony offenses.  One case involved an online activist who scraped the MIT website and ultimately downloaded millions of academic articles.  This guy is now free on bond, but faces dozens of years in prison and $1 million if convicted.  Another case involves a real estate company who illegally scraped listings and photos from a competitor in an attempt to gain a lead in the market.  Then, there's the case of a regional software company that was convicted of illegally scraping a major database company's websites in order to gain a competitive edge.  The software company had to pay a $20 million fine and the guilty scraper is serving three years probation.  Finally, there's the case of a medical website that hosted sensitive patient information.  In this case, several patients had posted personal drug listings and other private information on closed forums located on the medical website.  The website was scraped by a media-rese
arch firm, and all this information was suddenly public.

While many illegal web scrapers have been caught by the authorities, many more have never been caught and still run loose on websites around the world.  As you can see, it's increasingly important to guard against this activity.  After all, the information on your website belongs to you, and you don't want anyone else taking it without your permission.

The Good News

As we've noted, web scraping is a real problem for many companies today.  The good news is that F5 has web scraping protection built into the Application Security Manager (ASM) of its BIG-IP product family.  As you can see in the screenshot below, the ASM provides web scraping protection against bots, session opening anomalies, session transaction anomalies, and IP address whitelisting.

The bot detection works with clients that accept cookies and process JavaScript.  It counts the client's page consumption speed and declares a client as a bot if a certain number of page changes happen within a given time interval.  The session opening anomaly spots web scrapers that do not accept cookies or process JavaScript.  It counts the number of sessions opened during a given time interval and declares the client as a scraper if the maximum threshold is exceeded.  The session transaction anomaly detects valid sessions that visit the site much more than other clients.  This defense is looking at a bigger picture and it blocks sessions that exceed a calculated baseline number that is derived from a current session table.  The IP address whitelist allows known friendly bots and crawlers (i.e. Google, Bing, Yahoo, Ask, etc), and this list can be populated as needed to fit the needs of your organization.

I won't go into all the details here because I'll have some future articles that dive into the details of how the ASM protects against these types of web scraping capabilities.  But, suffice it to say, ASM does a great job of protecting your website against the problem of web scraping.

I'm sure as you studied the screenshot above you also noticed lots of other protection capabilities the ASM provides...brute force attack prevention, customized attack signatures, Denial of Service protection, etc.  You might be wondering how it does all that stuff as well.  Give us a little feedback on the topics you would like to see, and we'll start posting some targeted tech tips for you!

Thanks for reading this introductory web scraping article...and, be sure to come back for the deeper look into how the ASM is configured to handle this problem. For more information, check out this video from Peter Silva where he discusses ASM botnet and web scraping defense.

Source: https://devcentral.f5.com/articles/web-scraping-data-collection-or-illegal-activity

Tuesday, 28 April 2015

Web Scraping – Effective Way of Improving Market Presence

Web scraping is a technique that is fast making its presence felt in the world of internet by its sheer weight of being effective. It is a technique that uses software to crawl through the internet and gather up all the relevant and important information that one would need for their products.

The information gathered by the web scraping can be used for various things such as data integration, web mashup, online comparison of price and much more. Web scraping uses sophisticated software that crawls through the internet and gathers up all related information for the entity that you are looking for. The information that is gathered up is an automated, systematic, and very structured way. This allows for easy understanding of the gathered information. Though this is one of the best ways for data extraction there are quite a few things that one must be aware of before getting into web scraping.

Being aware of the following things keep you at a better position not only leverage the best deal, but also to negotiate properly.

•    For data mining the first thing that one should be very sure of is the kind of data they want. One has to define properly what kind of data they want and also what would be the purpose of the same. For an instance if you wish to get a closer look at your competitors, it would be a wise to let the data scraping service providers know who your competitors are. This would allow them to gather better information. Similarly if you are looking for getting new customers getting contact data from existing players in the respective industry would be helpful.

•    One should also be aware of the structure in which they want the data. A simple data structure has the entity name in the row and the property of the entity is kept in the cells of the rows. However, one can also opt for data structure in chart. Apart from the above, there is just one more thing that one needs to keep in mind while using the data mining services; it is the number of data extraction. At times a onetime data extraction would be sufficient whereas at other times periodic extractions or general reports are required.

If you are aware of all the above points, then you are very much inline of going ahead and taking the help of scrape website data. Knowing the above points would allow you to know what exactly to ask from your vendor and likewise quote. One can make the most of the data extraction services with the help of either the web scraping or web crawling services.

Source: https://3idatascraping.wordpress.com/2014/01/07/web-scraping-effective-way-of-improving-market-presence/

Wednesday, 22 April 2015

Hard-Scraped Hardwood Flooring: Restoration of History

Throughout History hardwood flooring has undergone dramatic changes from the meticulous hard-scraped hardwood polished floors of majestic plantations of the Deep South, to modern day technology providing maintenance free wood flooring designed for comfort and appearance. The hand-scraped hardwood floors of the South, depicted charm with old rustic nature and character that was often associated with this time era. To date, hand-scraped hardwood flooring is being revitalized and used in up-scale homes and places of businesses to restore the old country charm that once faded into oblivion.

As the name implies, hand-scraped flooring involves the retexturing the top layer of flooring material by various methods in an attempts to mimic the rustic appearance of flooring in yesteryears. Depending on the degree of texture required, hand scraping hardwood material is often accomplished by highly skilled craftsmen with specialized tools and years of experience perfecting this procedure. When properly done, hand-scraped hardwood floors add texture, richness and uniqueness not offered in any similar hardwood flooring product.

Rooted with history, these types of floors are available in finished or unfinished surfaces. The majority of the individuals selecting hand-scraped hardwood flooring elect a prefinished floor to reduce costs per square foot in installation and finishing labor charges, allowing for budget guidelines to bend, not break. As expected, hand-scraped flooring is expensive and depending on the grade and finish selected, can range from $15-40$ per square foot and beyond for material only. Preparation of the material is labor intensive adding to the overall cost per square foot dramatically. Recommended professional installation can and often does increase the cost per square foot as well, placing this method of hardwood flooring well out of reach of the average hardwood floor purchaser.

With numerous selections of hand-scraped finishes available, each finish is designed to bring out a different appearance making it a one-of-a-kind work of art. These numerous finish selections include:

• Time worn aged, dark coloring stain application bringing out grain characteristics

• Wire brushed, providing a highlighted "grainy" effect with obvious rough texture

• Hand sculpted, smoother distressed uniform appearance

• French Bleed, staining of edges and side joints with a much darker stain to give a bleeding effect to the wood

• Hand Hewn or Rough Sawn, with visible and noticeable saw marks

Regardless of the selection made, scraped flooring cannot be compared to any other available flooring material based on durability, strength and visual appearance. Limited by only the imagination and creativity, several wood species can be used to create unusual floor patterns, highlighting main focal points of personal libraries and art collections.

The precise process utilized in the creation of scraped floors projects a custom look with deep color and subtle warm highlights. With radiant natural light reflecting off this type of floor, the effect of beauty and depth is radiated in a fashion that fills the room with solitude and serenity encompassing all that enter. Hand-scraped hardwood floors speak of the past, a time of decent, a time or war and ambiguity towards other races and the blood- shed so that all men could be treated as equals. More than exquisite flooring, hand-scraped hardwood flooring is the restoration of History.

Source: http://ezinearticles.com/?Hard-Scraped-Hardwood-Flooring:-Restoration-of-History&id=6333218

Thursday, 9 April 2015

Some Traps to know and avoid in Web Scraping

In the present day and age, web scraping comes across as a handy tool in the right hands. In essence, web scraping means quickly crawling the web for specific information, using pre-written programs. Scraping efforts are designed to crawl and analyze the data of entire websites, and saving the parts that are needed. Many industries have successfully used web scraping to create massive banks of relevant, actionable data that they use on a daily basis to further their business interests and provide better service to customers. This is the age of the Big Data, and web scraping is one of the ways in which businesses can tap into this huge data repository and come up with relevant information that aids them in every way.

Web scraping, however, does come with its own share of problems and roadblocks. With every passing day, a growing number of websites are trying to actively minimize the instance of scraping and protect their own data to stay afloat in today’s situation of immense competition. There are several other complications which might arise and several traps that can slow you down during your web scraping pursuits. Knowing about these traps and how to avoid them can be of great help if you want to successfully accomplish your web scraping goals and get the amount of data that you require.

Complications in Web Scraping

Over time, various complications have risen in the field of web scraping. Many websites have started to get paranoid about data duplication and data security problems and have begun to protect their data in many ways. Some websites are not generally agreeable to the moral and ethical implications of web scraping, and do not want their content to be scraped. There are many places where website owners can set traps and roadblocks to slow down or stop web scraping activities. Major search engines also have a system in place to discourage scraping of search engine results. Last but not the least, many websites and web services announce a blanket ban on web scraping and say the same in their terms and conditions, potentially leading to legal issues in the event of any scraping.

Here are some of the most common complications that you might face during your web scraping efforts which you should be particularly aware about –

•    Some locations on the intranet might discourage web scraping to prevent data duplication or data theft.

•    Many websites have in place a number of different traps to detect and ban web scraping tools and programs.

•    Certain websites make it clear in their terms and conditions that they consider web scraping an infringement of their privacy and might even consider legal redress.

•    In a number of locations, simple measures are implemented to prevent non-human traffic to websites, making it difficult for web scraping tools to go on collecting data at a fast pace.

To surmount these difficulties, you need a deeper and more insightful understanding of the way web scraping works and also the attitude of website owners towards web scraping efforts. Most major issues can be subverted or quietly avoided if you maintain good working practice during your web scraping efforts and understand the mentality of the people whose sites you are scraping.

Common Problems

With automated scraping, you might face a number of common problems. The behavior of web scraping programs or spiders presents a certain picture to the target website. It then uses this behavior to distinguish between human users and web scraping spiders. Depending on that information, a website may or may not employ particular web scraping traps to stop your efforts. Some of the commonly employed traps are –

Crawling Pattern Checks – Some websites detect scraping activities by analyzing crawling patterns. Web scraping robots follow a distinct crawling pattern which incorporates repetitive tasks like visiting links and copying content. By carefully analyzing these patterns, websites can determine that they are being caused by a web scraping robot and not a human user, and can take preventive measures.

Honeypots – Some websites have honeypots in their webpages to detect and block web scraping activities. These can be in the form of links that are not visible to human users, being disguised in a certain way. Since your web crawler program does not operate the way a human user does, it can try and scrape information from that link. As a result, the website can detect the scraping effort and block the source IP addresses.

Policies – Some websites make it absolutely apparent in their terms and conditions that they are particularly averse to web scraping activities on their content. This can act as a deterrent and make you vulnerable against possible ethical and legal implications.

Infinite Loops – Your web scraping program can be tricked into visiting the same URL again and again by using certain URL building techniques.

These traps in web scraping can prove to be detrimental to your efforts and you need to find innovative and effective ways to surpass these problems. Learning some web crawler tips to avoid traps and judiciously using them is a great way of making sure that your web scraping requirements are met without any hassle.

What you can do

The first and foremost rule of thumb about web scraping is that you have to make your efforts as inconspicuous as possible. This way you will not arouse suspicion and negative behavior from your target websites. To this end, you need a well-designed web scraping program with a human touch. Such a program can operate in flexible ways so as to not alert website owners through the usual traffic criteria used to spot scraping tools.

Some of the measures that you can implement to ensure that you steer clear of common web scraping traps are –

•    The first thing that you need to do is to ascertain if a particular website that you are trying to scrape has any particular dislike towards web scraping tools. If you see any indication in their terms and conditions, tread cautiously and stop scraping their website if you receive any notification regarding their lack of approval. Being polite and honest can help you get away with a lot.

•    Try and minimize the load on every single website that you visit for scraping. Putting a high load on websites can alert them towards your intentions and often might cause them to develop a negative attitude. To decrease the overall load on a particular website, there are many techniques that you can employ.

•    Start by caching the pages that you have already crawled to ensure that you do not have to load them again.

•    Also store the URLs of crawled pages.

•    Take things slow and do not flood the website with multiple parallel requests that put a strain on their resources.

•    Handle your scraping in gentle phases and take only the content you require.

•    Your scraping spider should be able to diversify its actions, change its crawling pattern and present a polymorphic front to websites, so as not to cause an alarm and put them on the defensive.

•    Arrive at an optimum crawling speed, so as to not tax the resources and bandwidth of the target website. Use auto throttling mechanisms to optimize web traffic and put random breaks in between page requests, with the lowest possible number of concurrent requests that you can work with.

•    Use multiple IP addresses for your scraping efforts, or take advantage of proxy servers and VPN services. This will help to minimize the danger of getting trapped and blacklisted by a website.

•    Be prepared to understand the respect the express wishes and policies of a website regarding web scraping by taking a good look at the target ‘robots.txt’ file. This file contains clear instructions on the exact pages that you are allowed to crawl, and the requisite intervals between page requests. It might also specify that you use a pre-determined user agent identification string that classifies you as a scraping bot. adhering to these instructions minimizes the chance of getting on the bad side of website owners and risking bans.

Use an advanced tool for web scraping which can store and check data, URLs and patterns. Whether your web scraping needs are confined to one domain or spread over many, you need to appreciate that many website owners do not take kindly to scraping. The trick here is to ensure that you maintain industry best practices while extracting data from websites. This prevents any incident of misunderstanding, and allows you a clear pathway to most of the data sources that you want to leverage for your requirements.

Hope this article helps in understanding the different traps and roadblocks that you might face during your web scraping endeavors. This will help you in figuring out smart, sensible ways to work around them and make sure that your experience remains smooth. This way, you can keep receiving the important information that you need with web scraping. Following these basic guidelines can help you prevent getting banned or blacklisted and stay in the good books of website owners. This will allow you continue with your web scraping activities unencumbered.

Source: https://www.promptcloud.com/blog/some-traps-to-avoid-in-web-scraping/

Tuesday, 7 April 2015

Thoughts on scraping SERPs and APIs

Google says that scraping keyword rankings is against their policy from what I've read. Bummer. We comprise a lot of reports and manual finding and entry was a pain. Enter Moz! We still manually check and compare, but it's nice having that tool. I'm confused now though about practices and getting SERPs in an automated way. Here are my questions

    Is it against policy to get SERPs from an automated method? If that is the case, isn't Moz breaking this policy with it's awesome keyword tracker?

    If it's not, and we wanted to grab that kind of data, how would we do it? Right now, Moz's API doesn't offer this data. I thought Raven Tools at one point offered this, but they don't now from what I've read. Are there any APIs out there that we can grab this data and do what we want with it? (let's day build our own dashboard)?

Thanks for any clarification and input!

Source: http://moz.com/community/q/thoughts-on-scraping-serps-and-apis

Friday, 27 March 2015

Web Data Extraction- The most convenient and easy way to extract data from the internet

Web data extraction is the most proficient technique that will help you find the pertaining data for your existing business or any personal use. Many times, we find that experts’ copy and paste information manually from web pages or download the entire website which is a waste of time and effort.

Now with the new technique of Web data extraction you can crawl through loads and loads of web pages in order to extract particular data and at the same time save this data in the following manner
  •     CSV FILE
  •     XML FILE or
  •     Any other custom format for future use.

Below given are some instances of Web data extraction processes:
  •     Take a government portal, extracting names of citizens for a survey
  •     Search for competitor websites for product pricing and feature information
  •     Utilize web scraping to download images from a stock photography site for website design

How can Web Data Extraction serve you?

 You can extract data from any kind of websites like


Extract Data from any kind of Websites: Directories, Classified Websites, News, Websites, Blogs, Articles, and Job Portals, Search Engines, eCommerce Websites, Social Media Websites and any kind of websites whose content can be accessible. Extract Emails, Contacts, Price/Rate, Features, Contact Names, Contact Details, Full Text, Live updates, ASINs, Meta Tags, Address, Phone, Fax, Latitude & Longitude, Images, Links, Reviews, Ratings, etc. Help in Data Collection, Competitor Analysis, Research, Business Intelligence, Social Media Trend analysis, Brand Monitoring, Lead Data Collection, Website & Competitor Web Monitoring, etc. Deliver Data in any Database, Excel, CSV, Access, Text, My SQL, SQL, Oracle, etc. and in any format Custom Services of Web Data Extraction as per client need one time Data Delivery or Continued/Scheduled Data Delivery

The next is Website Data Scraping:

 Web site Data Scraping is the process of extracting data from a website by using a particular software program available from proven website only.

This extracted data can be utilized by any person and for any purposes as per their needs and wants; data extracted can be used in different industries. There are many companies providing best Website data scraping services.

It is one such field which has active developments and also shares a common objective that needs a breakthrough in the following:
  •     Text Processing
  •     Semantic Understanding
  •     Artificial Intelligence
  •     Human Computer Interactions

There are many users or end users, companies and experts that need information or data that is accessible in some or the other format. In such cases Web Data Extraction can tailor the need of extracting data from any proven source and preserve the data on a particular destination.

The source platform contains:
  •     Excel
  •     CSV
  •     MySQL and
  •     Others

Moreover, the technique of Web data extractor can also extract information from various websites like Google, Amazon, LinkedIn, EBay and many others.

It can also extract data from eCommerce shopping websites, or other social networking websites, any public websites, classifieds websites, job portal websites and any other search engine websites.

Websitedatascraping.com is enough capable to web data scraping, website data scraping, web scraping services, website scraping services, data scraping services, product information scraping and yellowpages data scraping.

Tuesday, 24 March 2015

Web Data Extraction Services and Data Collection Form Website Pages

For any business market research and surveys plays crucial role in strategic decision making. Web scrapping and data extraction techniques help you find relevant information and data for your business or personal use. Most of the time professionals manually copy-paste data from web pages or download a whole website resulting in waste of time and efforts.

Instead, consider using web scraping techniques that crawls through thousands of website pages to extract specific information and simultaneously save this information into a database, CSV file, XML file or any other custom format for future reference.

Examples of web data extraction process include:


• Spider a government portal, extracting names of citizens for a survey

• Crawl competitor websites for product pricing and feature data

• Use web scraping to download images from a stock photography site for website design

Automated Data Collection
Web scraping also allows you to monitor website data changes over stipulated period and collect these data on a scheduled basis automatically. Automated data collection helps you discover market trends, determine user behavior and predict how data will change in near future.

Examples of automated data collection include:


• Monitor price information for select stocks on hourly basis

• Collect mortgage rates from various financial firms on daily basis

• Check whether reports on constant basis as and when required

Using web data extraction services you can mine any data related to your business objective, download them into a spreadsheet so that they can be analyzed and compared with ease.

In this way you get accurate and quicker results saving hundreds of man-hours and money!

With web data extraction services you can easily fetch product pricing information, sales leads, mailing database, competitors data, profile data and many more on a consistent basis.

Should you have any queries regarding Web Data extraction services, please feel free to contact us. We would strive to answer each of your queries in detail.

Source:http://ezinearticles.com/?Web-Data-Extraction-Services-and-Data-Collection-Form-Website-Pages&id=4860417

Tuesday, 17 March 2015

Safeguarding the Future Through Data Mining

Web scraping can be a powerful tool not only in business and research. In fact, it has the capacity to protect the future by its predicting power. You may find this declaration incredible; but data mining is indeed a tangible way of predicting future events and thus protecting life in the future.

With the thousands of years of existence on earth, humans are able to gather as much information and experience to have a glimpse of what is to come. With the cycles of changes in the environment and in the whole universe aside from the human behavior, so much can be learned and applied.

At least three major things can be determined by careful and diligent data mining. These are: future threats; future trends; and future tactics.

Future threats

According to reports, the US intelligence agencies have been using web extraction as a way of studying the present and past terrorism acts and personages to predict future terrorist events. This has been actively done since the year 2010.

Data is gathered about a known terrorist such as: his activities; his contacts; his routines; the places he frequents; and other related information. These data are analyzed and classified. Any suspicious activities as well as unusual contact are monitored closely. Through these stored data and monitoring processes, any untoward activities can be precluded and preempted. You may say that terrorists can be using data mining too; and that is obviously possible. In this way, web scraping can also be used as a weapon for destruction. There is then a need for the government agencies to be very careful in protecting their data so that the enemies cannot retrieve them.

In the overall picture, you can just imagine how many lives, trauma, and damage can be prevented if future terrorist activities are prevented.

Moreover, climate change is another phenomenon that has already been predicted and is beginning to occur nowadays. Scientists have been studying the effects of global warming and environmental degradation through online data too. So much information drives and warning have been published by scholarly papers and by the experts but many of these have remained unheeded. Now that erratic weather conditions are happening, people can only regret and feel guilty that they are part of the cause of the problem.

However, it is not really too late to do some actions. People can avoid places where abnormal conditions are expected to happen; they can do some measures to protect themselves; and they can be informed ahead of time before anything catastrophic could happen.

Future trends

In relation to the predictions of possible threats, data extraction can also predict future trends. This is most helpful in businesses because they can be helped to produce items and employ strategies that will suit the expected patrons and clients.  Since history tends to repeat itself, data gathered in the past and present if studied judiciously and compared intelligently can bring in positive results.

Oftentimes, the companies that study their books as well as of those who have gone before them can gain more knowledge and expertise that will surely put them ahead of their contemporaries.

Future tactics

Naturally, along with knowing the possible events and trends in the future, strategies and ways to combat threats and cope with trends can also be predicted through web scraping.

Safeguarding the future is no longer a dream or wish. As early as today, experts can create equipment, structures, strategies, and even weapons to prevent any untoward incidents and collateral damage.

Studying the strengths and weaknesses of the past and present plans, procedures, and tools can lead to better technologies and techniques. The future can be a better and safer place if people can learn from the mistakes of the past and go from good to better.

The statement: “The best is yet to come,” will finally be realized if proper management of data and information collected and analyzed through web scraping will be conducted.

Bright future


Looking at the horizon, one can always expect the sun to shine and bring in a bright day. This same positive expectation for the future is indeed possible. Thanks to data mining; life can be handled more securely and precisely.

It does not mean that humans have become gods. It only proves that a person’s talents and skills, when used properly can make his/her future brighter and more successful. On the other hand, carelessness and lack of sensibilities to other people and the environment can surely bring in future doom.

Everything is laid bare and you are given the chance to handle the present with enough wisdom and capabilities. Although the world is too big to be understood and there is still a huge field of knowledge to be conquered, life can surely go on positively.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/257-safeguarding-the-future-through-data-mining/

Sunday, 15 March 2015

Data Mining Services in various types

Data Mining is an act of transferring data into beneficial Information and actionable insight. Often known as Knowledge Discovery in Databases (KDD), Data Mining is a automated process to uncover a series of never-seen-before information in bulk quantities of data scenario. Post evaluating a series of random factors, which the human mind cannot easily look at or comprehend, it helps in reaching towards an actionable insight by means of progressive mathematical algorithms. These data mining reports are further distributed among esteemed influencers and stakeholders, and are used for enterprise-caliber data mining observations in an insightful manner.

The Process of Data Mining


Here’s a lowdown of a few used cases of how companies are using Data Mining Services in business:

ASSOCIATION: - Catching hold of frequently appearing observations. For instance, if you want to know which products are regularly purchased in pair, and could be offered together in a combo offer to boost sales.

CLASSIFICATION:- Allowing the Data Mining experts at LoginWorks Software to attach observations towards repeated financial patterns of existing groups or categories. For instance, spotting fraudulent transactions or possibly bankrupt companies.

CLUSTERING:- Identifying similarities and common ground between observations and groups. For instance, creating profiles for website users or clients by mapping website usage pattern and customer behavior.

DESCRIPTION:- Detailing out patterns and showcasing them in a visual manner using explanatory analysis.

ESTIMATION:- Revealing features that are difficult to observe with a straight-lined approach because of cost of observation or technical problems. PREDICTION: Predicting an estimated future using previous and present observations. for examples, predicting sales for the next financial period.

What are the Strategic Benefits of Incorporating Data Mining

A Comprehensive suite of Data Mining Services can help your company to:

•    Iron out strategic business problems with the use of number crunching, predictive and inferential analysis.

•    Recuperate your data mining atmosphere by making use of advanced algorithms, artificial neural networks, induction techniques, along with in-data and base-data mining technologies.

•    Automate business trends, understand human behavior and patterns predictions.

•    Do away with complexities of difficult-to-comprehend statistics and, need not necessarily require users to make use of complex applications/interface. Instead, we deliver compact results in the form of touch points, such as Excel, CSV, XML, text file and more.

•    Achieve high-end connectivity and communication capabilities.

The Power of LOGINWORKS Data Mining Services.

LOGINWORKS SOFTAWARES Data Mining Service is an advanced solution for predictive analytics designed to help companies in their strategic decision making. An ongoing process of discovery and interpretation, data mining unearths new and reliable patterns in your accumulated data and patterns, which you can make use of to adhere to testing business questions that calls for constant prediction and inference. With the ever evolving increase of business complexities, as well as the quantity and multiplicity of data, there’s a buzzing need for methods that are intelligently mechanical in nature and are backed by LOGINWORKS SOFTAWARES’ expert support; and data mining that fits the need of today’s businesses aptly. By and large, predictive data mining services makes use of pattern recognition technologies and statistical tools to help accelerate strategic business decisions and lead to more informed conversations with the target audience.

What is offered in our Data Mining Service.

•    First stage of discussion and estimating future direction: If your company would like to gain a competitive edge from our high-calibre Data Mining Services, do get in touch with our sales team at sales@loginworks.com so as to help you in understanding the most advanced benefits and opportunities.

•    Sharing feasibility statistics and studies: If you or your company has a clear view point of how you would want to make use of Data Mining in your flow of business, then do share with us your requirement to ask for a quote.

•    Segmentation and Profitability: Right from assessing the initial assessment to assessing the benefits and completing the data, we’ll share with you a comprehensive report on understanding of data needs.

•    The Final Stage: Data Mining Implementation Service: As soon as the Data Mining requirement is clearly undertstood, we build customized solutions to collect data in an automated fashion and export structured data into usable format.

Big data solutions and services


Big data swiftly harnesses the ever-increasing volume on data on day-to-day basis and the incessant need of enterprisers to harness the true business value of such data in a quick turnaround time. Opening gates to a world of opportunities to find new and insightful calculations, Big Data can be generated at a variety of myriad speeds and types. This data further lends organizations, especially the burgeoning e-commerce industry of today, a competitive advantage, where estimated predictions becomes the bedrock of constant in-flow of costs and revenue.

LOGINWORKS SOFTWARES BIG DATA ADVANTAGE

Unearth the power of the accumulated data by making significant inroads into the digital revolution of 21st century. Leverage the advantage by using LOGINWORKS SOFTWARES end-to-end Big Data Solutions and Services. Our passion, backed with years of domain expertise and rich technical prowess empowers you to outline a Big Data strategy for your business to help uplift your overall IT roadmap, architect and re-imagine your business strategies. With us, you get the following services: Our all-encompassing THINK, DEVELOP AND IMPLEMENT model for Big Data Services aids you to pick up the best strategies to adopt and use data. Our principle areas of focus for Big Data services are:

•    Big Data Management for the IT Organization

•    Big Data Analytics for the Business Organization

LOGINWORKS SOFTWARES Data Mining Services, also known as Loginworks DataStream is a perfect amalgamation of unlimited volumes, robust technology and matchless expertise. What sets up apart is our one-of-a-kind personalised approach, which makes use of optimal data warehouse technology. IF YOU ARE READY TO TAKE THE ADVANTAGE OF DATA MINING AND BOOST YOUR BUSINESS – CONTACT LOGINWORKS SOFTWARES TODAY!

Source: http://www.loginworks.com/data-mining-services-various-type/

Monday, 9 March 2015

4 Types of Mining and Quarry Equipment

You've surveyed the land, spent hours of painstaking research planning the process, and now you are ready to take your mining to the next level. The next phase of the journey will involve the proper equipment. There are several different types of equipment that go into mining. This article will cover some different types of mining equipment to get you started; and if you've already started, it will provide you with valuable knowledge to upgrade your equipment in the future.

Vibrating Feeder


A vibrating feeder is a type of linear direction feeding equipment. Vibrating feeders are widely used in mining and quarry operations and offer several features. Its features include smooth vibrating, dependable operation, and lengthy service life; most importantly, vibrating feeders are suitable for feeding. It consists of a vibrating frame, vibrator, spring, motor vibrating device and motor. The motor causes the vibrating feeder's shafts to rotate, which produces massive power and forces the feeder to vibrate. As the feeder vibrates, materials pass through its funnel and are screened. Common feeder applications include chemicals, minerals, coal, aggregates, and industrial materials.

Permanent and Self Cleaning Magnets


Both permanent and self cleaning magnets can be used on conveyor belts. A permanent magnet is built from a material that is magnetized and creates its own constant magnetic field. Permanent magnetic separators are used on conveyor belts when only limited levels of ferrous metals are expected to pass through. Self cleaning magnets are based on the technology of permanent and electro magnets. The purpose of self cleaning magnetic separators is to automatically remove and separate metallic and non metallic materials.

Forklifts


A forklift is an essential piece of equipment to any mining operation. Industrial forklifts can efficiently move mining supplies and material. Rough terrain forklifts can handle rugged conditions and still manage to perform well. With the standard attachments, a forklift can scoop, grip, pull, lift, scrape, clean, and load almost anything on a mine. Forklifts are extremely versatile and will increase the efficiency of your mining process.

Earthmoving and Construction Equipment

Earthmoving and construction equipment are essential to any mining or quarry operation. This equipment is often used in construction, turf, and excavation work. Examples of earthmoving and construction equipment include: backhoes, bulldozers, and dumpers. A backhoe loader is used for a wide range of jobs including excavating or breaking asphalt. A backhoe loader is a tractor with an arm linking a bucket to the front and a second arm attaching a backhoe in the rear. One benefit of a backhoe loader is that it is fairly small and versatile. A bulldozer is used to push heavy loads like sand from a demolished structure. It is usually much larger than a backhoe loader. A dumper is also designed to carry heavy loads, but these loads are usually in smaller quantities than those of a bulldozer. Backhoes, bulldozers, and dumpers are three types of earthmoving equipment that are critical to the success and efficiency of a mining or construction project.

These are some quarry, mining & construction Equipment in Perth and other places, which are quite essential and useful in these processes. Click on this link to get more information on other necessary equipment.

Source: http://ezinearticles.com/?4-Types-of-Mining-and-Quarry-Equipment&id=7383379

Wednesday, 4 March 2015

Why Outsourcing Data Mining Services?

Are huge volumes of raw data waiting to be converted into information that you can use? Your organization's hunt for valuable information ends with valuable data mining, which can help to bring more accuracy and clarity in decision making process.

Nowadays world is information hungry and with Internet offering flexible communication, there is remarkable flow of data. It is significant to make the data available in a readily workable format where it can be of great help to your business. Then filtered data is of considerable use to the organization and efficient this services to increase profits, smooth work flow and ameliorating overall risks.

Data mining is a process that engages sorting through vast amounts of data and seeking out the pertinent information. Most of the instance data mining is conducted by professional, business organizations and financial analysts, although there are many growing fields that are finding the benefits of using in their business.

Data mining is helpful in every decision to make it quick and feasible. The information obtained by it is used for several applications for decision-making relating to direct marketing, e-commerce, customer relationship management, healthcare, scientific tests, telecommunications, financial services and utilities.

Data mining services include:

•    Congregation data from websites into excel database

•    Searching & collecting contact information from websites

•    Using software to extract data from websites

•    Extracting and summarizing stories from news sources

•    Gathering information about competitors business

In this globalization era, handling your important data is becoming a headache for many business verticals. Then outsourcing is profitable option for your business. Since all projects are customized to suit the exact needs of the customer, huge savings in terms of time, money and infrastructure can be realized.

Advantages of Outsourcing Data Mining Services:

•    Skilled and qualified technical staff who are proficient in English

•    Improved technology scalability

•    Advanced infrastructure resources

•    Quick turnaround time

•    Cost-effective prices

•    Secure Network systems to ensure data safety

•    Increased market coverage

Outsourcing will help you to focus on your core business operations and thus improve overall productivity. So data mining outsourcing is become wise choice for business. Outsourcing of this services helps businesses to manage their data effectively, which in turn enable them to achieve higher profits.

This article is courtesy of Flori Lee - an executive at Outsourcing Web Research offer high quality and time bound comprehensive range of data mining services at affordable rates. We are specialized in providing data mining services at 60% less data mining rates.

Source: http://ezinearticles.com/?Why-Outsourcing-Data-Mining-Services?&id=3066061