Citysquares.com Data Scraping: September 2013

Saturday, 28 September 2013

Visual Web Ripper: Using External Input Data Sources

Sometimes it is necessary to use external data sources to provide parameters for the scraping process. For example, you have a database with a bunch of ASINs and you need to scrape all product information for each one of them. As far as Visual Web Ripper is concerned, an input data source can be used to provide a list of input values to a data extraction project. A data extraction project will be run once for each row of input values.

An input data source is normally used in one of these scenarios:

    To provide a list of input values for a web form
    To provide a list of start URLs
    To provide input values for Fixed Value elements
    To provide input values for scripts

Visual Web Ripper supports the following input data sources:

    SQL Server Database
    MySQL Database
    OleDB Database
    CSV File
    Script (A script can be used to provide data from almost any data source)

To see it in action you can download a sample project that uses an input CSV file with Amazon ASIN codes to generate Amazon start URLs and extract some product data. Place both the project file and the input CSV file in the default Visual Web Ripper project folder (My Documents\Visual Web Ripper\Projects).

For further information please look at the manual topic, explaining how to use an input data source to generate start URLs.

Source: http://extract-web-data.com/visual-web-ripper-using-external-input-data-sources/

Thursday, 26 September 2013

Using External Input Data in Off-the-shelf Web Scrapers

There is a question I’ve wanted to shed some light upon for a long time already: “What if I need to scrape several URL’s based on data in some external database?“.

For example, recently one of our visitors asked a very good question (thanks, Ed):

“I have a large list of amazon.com asin. I would like to scrape 10 or so fields for each asin. Is there any web scraping software available that can read each asin from a database and form the destination url to be scraped like http://www.amazon.com/gp/product/{asin} and scrape the data?”

This question impelled me to investigate this matter. I contacted several web scraper developers, and they kindly provided me with detailed answers that allowed me to bring the following summary to your attention:
Visual Web Ripper

An input data source can be used to provide a list of input values to a data extraction project. A data extraction project will be run once for each row of input values. You can find the additional information here.
Web Content Extractor

You can use the -at”filename” command line option to add new URLs from TXT or CSV file:

WCExtractor.exe projectfile -at”filename” -s

projectfile: the file name of the project (*.wcepr) to open.
filename – the file name of the CSV or TXT file that contains URLs separated by newlines.
-s – starts the extraction process

You can find some options and examples here.
Mozenda

Since Mozenda is cloud-based, the external data needs to be loaded up into the user’s Mozenda account. That data can then be easily used as part of the data extracting process. You can construct URLs, search for strings that match your inputs, or carry through several data fields from an input collection and add data to it as part of your output. The easiest way to get input data from an external source is to use the API to populate data into a Mozenda collection (in the user’s account). You can also input data in the Mozenda web console by importing a .csv file or importing one through our agent building tool.

Once the data is loaded into the cloud, you simply initiate building a Mozenda web agent and refer to that Data list. By using the Load page action and the variable from the inputs, you can construct a URL like http://www.amazon.com/gp/product/%asin%.
Helium Scraper

Here is a video showing how to do this with Helium Scraper:

The video shows how to use the input data as URLs and as search terms. There are many other ways you could use this data, way too many to fit in a video. Also, if you know SQL, you could run a query to get the data directly from an external MS Access database like
SELECT * FROM [MyTable] IN "C:\MyDatabase.mdb"

Note that the database needs to be a “.mdb” file.
WebSundew Data Extractor
Basically this allows using input data from external data sources. This may be CSV, Excel file or a Database (MySQL, MSSQL, etc). Here you can see how to do this in the case of an external file, but you can do it with a database in a similar way (you just need to write an SQL script that returns the necessary data).
In addition to passing URLs from the external sources you can pass other input parameters as well (input fields, for example).
Screen Scraper

Screen Scraper is really designed to be interoperable with all sorts of databases. We have composed a separate article where you can find a tutorial and a sample project about scraping Amazon products based on a list of their ASINs.

Source: http://extract-web-data.com/using-external-input-data-in-off-the-shelf-web-scrapers/

Wednesday, 25 September 2013

A simple way to turn a website into JSON

Recently, while surfing the web I stumbled upon an simple web scraping service named Web Scrape Master. It is a kind of RESTful web service that extracts data from a specified web site and returns it to you in JSON format.
How it works

Though I don’t know what this service may be useful for, I still like its simplicity: all you need to do is to make an HTTP GET request, passing all necessary parameters in the query string:
http://webscrapemaster.com/api/?url={url}&xpath={xpath}&attr={attr}&callback={callback}

    url - the URL of the website you want to scrape
    xpath – xpath determining the data you need to extract
    attr - attribute the name you need to get the value of (optional)
    callback - JSON callback function (optional)

For example, for the following request to our testing ground:

http://webscrapemaster.com/api/?url=http://testing-ground.extract-web-data.com/blocks&xpath=//div[@id=case1]/div[1]/span[1]/div

You will get the following response:

[{"text":"<div class='name'>Dell Latitude D610-1.73 Laptop Wireless Computer</div>","attrs":{"class":"name"}}]
Visual Web Scraper

Also, this service offers a special visual tool for building such requests. All you need to do is to enter the URL of the website and click to the element you need to scrape:
Visual Web Scraper
Conclusion

Though I understand that the developer of this service is attempting to create a simple web scraping service, it is still hard to imagine where it can be useful. The task that the service does can be easily accomplished by means of any language.

Probably if you already have software receiving JSON from the web, and you want to feed it with data from some website, then you may find this service useful. The other possible application is to hide your IP when you do web scraping. If you have other ideas, it would be great if you shared them with us.

Source: http://extract-web-data.com/a-simple-way-to-turn-a-website-into-json/

Tuesday, 24 September 2013

Selenium IDE and Web Scraping

Selenium is a browser automation framework that includes IDE, Remote Control server and bindings of various flavors including Java, .Net, Ruby, Python and other. In this post we touch on the basic structure of the framework and its application to Web Scraping.
What is Selenium IDE

Selenium IDE is an integrated development environment for Selenium scripts. It is implemented as a Firefox plugin, and it allows recording browsers’ interactions in order to edit them. This works well for software tests, composing and debugging. The Selenium Remote Control is a server specific for a particular environment; it causes custom scripts to be implemented for controlled browsers. Selenium deploys on Windows, Linux, and iOS. How various Selenium components are supported with major browsers read here.
What does Selenium do and Web Scraping

Basically Selenium automates browsers. This ability is no doubt to be applied to web scraping. Since browsers (and Selenium) support JavaScript, jQuery and other methods working with dynamic content why not use this mix for benefit in web scraping, rather than to try to catch Ajax events with plain code? The second reason for this kind of scrape automation is browser-fasion data access (though today this is emulated with most libraries).

Yes, Selenium works to automate browsers, but how to control Selenium from a custom script to automate a browser for web scraping? There are Selenium PHP and other language libraries (bindings) providing for scripts to call and use Selenium. It is possible to write Selenium clients (using the libraries) in almost any language we prefer, for example Perl, Python, Java, PHP etc. Those libraries (API), along with a server, the Java written server that invokes browsers for actions, constitute the Selenum RC (Remote Control). Remote Control automatically loads the Selenium Core into the browser to control it. For more details in Selenium components refer to here.

A tough scrape task for programmer

“…cURL is good, but it is very basic. I need to handle everything manually; I am creating HTTP requests by hand.
This gets difficult – I need to do a lot of work to make sure that the requests that I send are exactly the same as the requests that a browser would
send, both for my sake and for the website’s sake. (For my sake
because I want to get the right data, and for the website’s sake
because I don’t want to cause error messages or other problems on their site because I sent a bad request that messed with their web application). And if there is any important javascript, I need to imitate it with PHP.
It would be a great benefit to me to be able to control a browser like Firefox with my code. It would solve all my problems regarding the emulation of a real browser…
it seems that Selenium will allow me to do this…” -Ryan S

Yes, that’s what we will consider below.
Scrape with Selenium

In order to create scripts that interact with the Selenium Server (Selenium RC, Selenium Remote Webdriver) or create local Selenium WebDriver script, there is the need to make use of language-specific client drivers (also called Formatters, they are included in the selenium-ide-1.10.0.xpi package). The Selenium servers, drivers and bindings are available at Selenium download page.
The basic recipe for scrape with Selenium:

    Use Chrome or Firefox browsers
    Get Firebug or Chrome Dev Tools (Cntl+Shift+I) in action.
    Install requirements (Remote control or WebDriver, libraries and other)
    Selenium IDE : Record a ‘test’ run thru a site, adding some assertions.
    Export as a Python (other language) script.
    Edit it (loops, data extraction, db input/output)
    Run script for the Remote Control

The short intro Slides for the scraping of tough websites with Python & Selenium are here (as Google Docs slides) and here (Slide Share).
Selenium components for Firefox installation guide

For how to install the Selenium IDE to Firefox see here starting at slide 21. The Selenium Core and Remote Control installation instructions are there too.
Extracting for dynamic content using jQuery/JavaScript with Selenium

One programmer is doing a similar thing …

1. launch a selenium RC (remote control) server
2. load a page
3. inject the jQuery script
4. select the interested contents using jQuery/JavaScript
5. send back to the PHP client using JSON.

He particularly finds it quite easy and convenient to use jQuery for
screen scraping, rather than using PHP/XPath.
Conclusion

The Selenium IDE is the popular tool for browser automation, mostly for its software testing application, yet also in that Web Scraping techniques for tough dynamic websites may be implemented with IDE along with the Selenium Remote Control server. These are the basic steps for it:

    Record the ‘test‘ browser behavior in IDE and export it as the custom programming language script
    Formatted language script runs on the Remote Control server that forces browser to send HTTP requests and then script catches the Ajax powered responses to extract content.

Selenium based Web Scraping is an easy task for small scale projects, but it consumes a lot of memory resources, since for each request it will launch a new browser instance.

Source: http://extract-web-data.com/selenium-ide-and-web-scraping/

Monday, 23 September 2013

How Data Entry Outsourcing Can Benefit You

The debates on outsourcing go on, and though opponents come down heavy on it, the benefits are too many to be ignored. Data entry is a task that is widely outsourced. Managing data is not a trifling task for big organizations. Proper management of information is crucial for their efficient functioning. Organizations have to manage large volumes of data every day. Outsourcing helps manage such information. This article looks at how data entry outsourcing services can benefit you, and improve your productivity and return on investment.

Core Benefits

Efficient data entry services ensures your organization improved information systems, better customer satisfaction, readily available information, and records in keeping with necessary standards. All this saves time and money, and improves your productivity and efficiency. The data entry outsourcing services provided by an experienced outsourcing company offers many advantages:

Professional expertise: Outsourcing the job allows you to benefit from the expertise of professional operators working with advanced technology to ensure efficient solutions.

Save time and gain a competitive edge: Outsourcing services minimize your administrative workload. It also gives your employees more time to focus on other important tasks. This would definitely help you gain a competitive edge.

Improve productivity and revenue: Outsourcing services are consistent and uninterruptible. Regular monitoring is also possible.

Cut infrastructure costs: Professional data entry services can considerably reduce your operating overhead costs. Outsourcing completely overrules the need to invest in computer systems and other infrastructure, manpower and resources needed to do the job in-house.

Streamline documentation tasks: Well-organized data processing solutions of reliable BPO companies enables you to streamline your routine documentation workflow.

Maintain accurate information systems: Outsourcing enables you to maintain error-free and up-to-date official records. This facilitates easy access to and retrieval of relevant information at any time. You can avoid data back logs and get you records in easy-to-use electronic formats or as hard copies.

Benefit from disaster recovery: As all the data management is with proper back-up, you are ensured of disaster recovery solutions in case of data loss.

Security: All your data is secure as reliable service providers have security measures in place to prevent hacking.

Professional Data Processing Solutions

Many established companies in the US are equipped with advanced technology and experienced professionals with excellent skills in keyboard handling and handwriting recognition. These experts can process both numeric and alphanumeric data with great speed and accuracy. They provide proficient solutions for handwritten materials, texts, books, surveys, medical claims, insurance claims, medical billing forms, legal documents, images, practice forms, product details, scanned images, and more. The benefits you are assured of with these professional data entry services are:

• 99% accuracy rate
• Multi-level quality checking
• Safe and convenient file transfer options
• Stringent data confidentiality and security
• Customized turnaround time
• Competitive pricing, with cost savings up to 40 percent
• Continuous technical support
• Free trials

Select an Established Outsourcing Partner

In summary, professional data entry outsourcing services help streamline your workflow and ensure benefits in terms of business efficiency, costs, time, resources and effort. A web search can lead you to a reliable service provider.

Data Entry Outsourcing - MOS, a leading BPO company based in the US provides online data entry services for simple or complex, cost-sensitive or urgent projects.

Source: http://ezinearticles.com/?How-Data-Entry-Outsourcing-Can-Benefit-You&id=6361413

Friday, 20 September 2013

Offshore Data Entry Is The Need Of The Day For Any Business

To run a business successfully means embracing a new challenge everyday. It indicates exciting avenues to be ventured into and daunting hurdles to be overcome while keeping ahead of competitors. Each day is a new day that needs to be met with new strategies, plans and goals. However certain crucial aspects of running a business can become monotonous and can require repetitive work on a regular basis but the accuracy needs to be impeccable. The data entry requirements of a company fall under this category of essential tasks that can be quite time consuming and repeatable but essential for running the business successfully. Business houses are therefore looking for options to get this task done smoothly without using up important resources of the company yet maintaining the required standard of accuracy and confidentiality. Offshore data entry is therefore fast becoming the preferred option of every business entity.

Offshore data entry is the process of hiring an external entity to perform the data entry functions for the business in a country besides the ones where the products and services of the business will be sold or used. The offshore data entry services provided by a vendor help the firm access processed and accurate data that has been well -presented to be of maximum use to the firm. The offshore data entry firm employees have the task of collecting data from written or printed records and entering them into the computer system. This data is maintained in a systematic manner to be as informative to the business as possible. The offshore data entry records are then transferred back to the client for regular referral and checking. Some of the major countries that are providing such offshore data entry services are India, China, Russia, Pakistan, Nepal, Bangladesh, Egypt, Malaysia and others.

The major criteria for a job to be qualified for offshore requirements are that the task should be repeatable, have high information content, be transferable over the internet, the process is easy to set up and the wages paid to the offshore data entry staff must be reasonably lower than those in the original country. The major requirement for offshore data entry services arises from the strong need to cut down on costs and the internet and the facilities it provides has given a direction to this need. Offshore data entry jobs have opened up a world of opportunities for professionals around the world and the constant advancement in the field of technology and internet further add to the advantage.

The prevalent exposure to internet has enabled many freelancers across the globe to offer their services for offshore data entry to small businesses and this works out to be a winning deal for both the parties involved. Free trade advocates are vocal about their support for offshore data entry business as they feel that this will provide benefits to economies as a whole in the form of labor off shoring. Whatever may be the reason for a company to employ offshore assistance, but the fact remains that in today's world offshore data entry is a booming business and the trend definitely seems to be on an upward motion.

Source: http://ezinearticles.com/?Offshore-Data-Entry-Is-The-Need-Of-The-Day-For-Any-Business&id=646558

Thursday, 19 September 2013

Digitize Data With Data Processing Services

Unorganized data might cost you your numero UNO position in your domain. If you have well-organized data, it will not only be helpful in decision-making but will also guarantee a smooth flow of your business. If you are stuck with heaps of documents to be converted into electronic format. Then, outsourcing your files to a company providing Large Volume Data Processing Services is the most accurate and efficient option.

Data processing is the process in which computer programs and other processing systems are used to analyze, summarize and convert the data into an electronic format.

It involves a series of process which are: -

    Validation - This process checks that whether the entries are correct or not.
    Sorting - In this process, sorting is done either sequentially or in various sets.
    Summarize data - This process summarizes the data into main points.
    Aggregation - Combination of different fragments of records takes place in this process.
    Analysis - This process involves the analysis, interpretation and presentation of the collected and organized data.

Data processing companies have comprehensive knowledge about all the above mentioned steps and will provide a complete package of Large volume data processing services which includes: -

    Manual data entry
    Forms based data capture
    Full text data capture
    Digitization
    Document conversion
    Word Processing
    e-Book conversion
    Data extraction from web
    OCR- Optical character recognition

By outsourcing, you can get rid of large volumes of data pretty quickly and can lay more stress on core business activities.

You will have access to many other benefits like: -

    Heaps of cluttered and unorganized work will be organized, sorted and digitized.
    You can make use of neatly organized data to make informed business decisions.
    Chances of losing data will be scarce once it is digitized.
    You can do away with unwanted data and get access to relevant data.
    You can cut down the operating costs and need not incur any expenses in setting up infrastructure.
    You can get the data converted into a form of your choice.

Companies that deal with Large volume data processing services have the experience, expertise, manpower and technology to deliver results as per your expectations. They can handle your bulk of data easily and process it in your desired format within the deadline.

If you want your large volume of data to be digitized with accuracy and at cost-effective rates, choose an outsourcing company which has years of experience in providing Large volume data processing services. You just need to spend a few hours browsing on the net and then short-listing the prospectives. Once you are done with going through the portfolio of these firms and are contented with their information, you can negotiate the rate with them and stipulate the time.

This article about large volume data Processing services has been authored by Sam Efron. He is an experienced technical content writer from data-entry-india.com. With several years of experience and expertise of writing about Data Processing Services, he brings a seasoned maturity and knowledge to his articles.

Source: http://ezinearticles.com/?Digitize-Data-With-Data-Processing-Services&id=7963690

Monday, 16 September 2013

Assuring Scraping Success with Proxy Data Scraping

Have you ever heard of "Data Scraping?" Data Scraping is the process of collecting useful data that has been placed in the public domain of the internet (private areas too if conditions are met) and storing it in databases or spreadsheets for later use in various applications. Data Scraping technology is not new and many a successful businessman has made his fortune by taking advantage of data scraping technology.

Sometimes website owners may not derive much pleasure from automated harvesting of their data. Webmasters have learned to disallow web scrapers access to their websites by using tools or methods that block certain ip addresses from retrieving website content. Data scrapers are left with the choice to either target a different website, or to move the harvesting script from computer to computer using a different IP address each time and extract as much data as possible until all of the scraper's computers are eventually blocked.

Thankfully there is a modern solution to this problem. Proxy Data Scraping technology solves the problem by using proxy IP addresses. Every time your data scraping program executes an extraction from a website, the website thinks it is coming from a different IP address. To the website owner, proxy data scraping simply looks like a short period of increased traffic from all around the world. They have very limited and tedious ways of blocking such a script but more importantly -- most of the time, they simply won't know they are being scraped.

You may now be asking yourself, "Where can I get Proxy Data Scraping Technology for my project?" The "do-it-yourself" solution is, rather unfortunately, not simple at all. Setting up a proxy data scraping network takes a lot of time and requires that you either own a bunch of IP addresses and suitable servers to be used as proxies, not to mention the IT guru you need to get everything configured properly. You could consider renting proxy servers from select hosting providers, but that option tends to be quite pricey but arguably better than the alternative: dangerous and unreliable (but free) public proxy servers.

There are literally thousands of free proxy servers located around the globe that are simple enough to use. The trick however is finding them. Many sites list hundreds of servers, but locating one that is working, open, and supports the type of protocols you need can be a lesson in persistence, trial, and error. However if you do succeed in discovering a pool of working public proxies, there are still inherent dangers of using them. First off, you don't know who the server belongs to or what activities are going on elsewhere on the server. Sending sensitive requests or data through a public proxy is a bad idea. It is fairly easy for a proxy server to capture any information you send through it or that it sends back to you. If you choose the public proxy method, make sure you never send any transaction through that might compromise you or anyone else in case disreputable people are made aware of the data.

A less risky scenario for proxy data scraping is to rent a rotating proxy connection that cycles through a large number of private IP addresses. There are several of these companies available that claim to delete all web traffic logs which allows you to anonymously harvest the web with minimal threat of reprisal. Companies such as http://www.Anonymizer.com offer large scale anonymous proxy solutions, but often carry a fairly hefty setup fee to get you going.

The other advantage is that companies who own such networks can often help you design and implementation of a custom proxy data scraping program instead of trying to work with a generic scraping bot. After performing a simple Google search, I quickly found one company (www.ScrapeGoat.com) that provides anonymous proxy server access for data scraping purposes. Or, according to their website, if you want to make your life even easier, ScrapeGoat can extract the data for you and deliver it in a variety of different formats often before you could even finish configuring your off the shelf data scraping program.

Whichever path you choose for your proxy data scraping needs, don't let a few simple tricks thwart you from accessing all the wonderful information stored on the world wide web!

Source: http://ezinearticles.com/?Assuring-Scraping-Success-with-Proxy-Data-Scraping&id=248993