If you don’t collect data on your website, you’re already behind when it comes to marketing and analysis. Web data mining is a combination of collecting data from your own website and mining data from third-party websites. This can be a third-party competitor website, a search engine or just a collection of websites. Data mining is combined with marketing analysis, so you can identify trends, customer preferences and better layouts and design for your website that creates a better conversion rate.
What is Data Mining?
To understand web data mining, it helps to understand general data mining. There is no specific data you collect when you work with data mining techniques. The data you gather is what you need to identify any advantages you can gain from marketing.
Data mining starts at home. This means that you should always mine the user data you have simple access to. You have simple access, because it’s your website, your web server and your programming. You can gather any kind of data when the user hits your site, but you need to explore the right type of data. This can be hard for a first-time website owner. You might even depend solely on search engines to send you traffic, but this is a bad business model. One of your marketing goals should be to concentrate on building traffic from other sources. You can also increase conversion rates by collecting data relative to search engine users and their browsing habits on your site.
For instance, let’s say that you own an ecommerce store and you have a visitor that comes to your website from Google. Some of your visitors will bounce from the site, but others will engage. Your first priority with random search engine traffic is to engage the user and cater landing pages to these users. Obviously, you can’t make everyone happy, but data mining is used to help make a majority of users happy. You can have a high-level of mining incorporated into your pages or you can get very granular.
The first piece of data you might want to collect is what link the user clicks. Your first question might be whether or not a link matters, but it does. Google Analytics is a tool you place on your pages, and using this tool along with your own data mining helps determine the best way to set up your pages. For instance, if you have 1000 users who come to your site using the search engine term “widgets” and most of these users click on a “red widgets” link, it indicates that users are generally looking for “red widgets.” You would then create a landing page that has a call-to-action for red widgets prominently on your site.
Another important piece of information is any related items that users might buy during purchases of your main products. For instance, if you sell children’s toys, you might also sell batteries for those toys. Do your customers typically buy batteries together with your products? If so, then it’s probably a good idea to add batteries as an up-sell on your ecommerce store pages.
You can get more technical and granular even if it seems like useless information. Data mining depends on large amounts of data to produce quality results for your marketing analysis. The above two scenarios depend on thousands of data marks to properly return a reliable report that you can then use to change your ecommerce store design and marketing.
While this type of mining is necessary for a successful website, it doesn’t explain web data mining. Web data mining is for external analysis of the web.
What is Web Data Mining?
Web data mining takes much more resources and understanding of the Internet. You can control all of the reports and analysis tools when you mine from your own website. With web data mining, you don’t have control of the external websites, so you need to “crawl” the Internet.
Probably the biggest and most successful data miner is Google. Google has a crawler the company calls “Googlebot.” Googlebot spiders or crawls the Internet looking at hundreds of factors. This data is mined, stored and used for ranking your website. The Internet has trillions of web pages available, so you can imagine the type of resources needed to store and run algorithms.
There are several different types of web mining, and you can create a crawler that finds a certain type of data. Google is an example of a massive amount of resources and data mining capabilities. But you might want to mine for specific data.
For instance, use domaintools.com as an example. This website mines the web for server and domain data. You type a name into Domain Tools’ website and the application returns several pieces of data. The site tells you “whois” information, which means it returns the registration information for the domain. It also tells you the IP for the web server, and other websites hosted on the server, the country at which the domain is hosted and even shows you a snapshot of the domain’s home page. Google also collects much of this data, but Domain Tools is specific for domain registration and server analytics.
Google released a “Penguin” algorithm several years ago. This algorithm devalued sites with spam links. Some webmasters capitalized on this algorithm change and created a data mining tool for backlinks. Tools such as Majestic SEO and Ahrefs mine the web for backlinks. These tools also give you reports and statistics for these backlinks, so website owners can identify good from bad links to clean up their website’s backlinks.
You can also mine the web for application information. For instance, the web contains billions of websites, but you might want to collect data that identifies the structure, language used to create the website and the “type” of website hosted. This information can be used as competitor analysis for your own site. For instance, suppose you want to identify information on the first 10 search engine results for other ecommerce stores. You can crawl the web server and identify the operating system used, which is returned in the server header information. You can identify back-end coding languages by the site’s structure and sometimes its source code. Website owners can install several pre-packaged software for ecommerce such as Joomla, WordPress, Open-Cart or Magento. You can usually identify this software from source code. This can help you identify your competitor’s platform, so you can either use the same or develop your own.
Mining also lets you grab analytical information on a website’s structure. You can obtain a sitemap from links on the site or links found on the web. With this information, you can analyze a site’s structure. Sitemaps are XML files that identify pages on a website. Search engines use sitemaps to find web pages. Backlinks also help a search engine find web pages. Together, you can mine a site’s structure.
You can even grab content from a site. Web servers return a page with HTML code, but it also contains the content for that page. Google’s crawlers and algorithms are masters of identifying content and focus for web pages. This type of mining takes massive storage, so you should expect high costs in storage for such a procedure.
How Do People Mine the Web?
Crawlers aren’t easy to create, so unless you know how the Internet works and how to code, you probably need to buy an engine or hire a programmer to create one for you. The type of crawler you create depends on the type of mining you want to do. You probably have limited finances, so collecting and storing everything is usually too expensive. A crawler can log every bit of information it finds or just log the specific information you want to collect.
A crawler is basically a bot. A bot is a program. Every search engine has a bot. These bots are the most popular data mining tools. You can think of a bot as a different kind of browser. Instead of grabbing a web page from a server and displaying the HTML on a user’s screen, the bot finds a page, grabs information and logs this information to a database.
A bot usually runs based on some kind of trigger. You can manually run this program, but most data mining bots run on a schedule. You can schedule it for certain times in the day or based on some kind of trigger such as finding a new website or link. You can even use your website visitors to trigger the bot. For instance, a user enters information into your ecommerce signup form and gives you a URL as a referral for how they find your site. After storing this URL, you then crawl the URL to data mine from it.
How you use this information is just as important as how you collect it. A good database design helps keep data integrity and avoids redundancy. Good database design also affects performance, so unless you want your reports to take hours to render, make sure your database design is normalized and indexed.
A bot’s complexity varies, but you should think of it in the same way you think of your browser. First, the bot does a lookup on the URL using DNS. DNS servers translate friendly domain URLs to IP addresses. The bot program can then use the IP address to “find” the web server and website on the Internet.
Next, the bot can view and store server headers. Server headers are set by your host, but if you have a dedicated server or VPS, you can set customer server headers. Server headers tell you a few things about the site. First, it tells you the server’s operating system. Second, the server sends a response code. There are several response codes. For instance, server response “200” means the page was returned without an error. A 404 means the page is “temporarily not found.” A “503” is service unavailable (usually for scheduled downtime such as maintenance). These are some of the most common responses, but your bot needs to account for each server response.
One issue to remember is that site owners sometimes watch bots and anonymous browsing. If you use too much of a website’s resources, the site owner or the host might block your bot. You should be considerate with a bot and acknowledge and honor a robots.txt file. A robot.txt file contains directives for bots. It’s always stored in the root, and it tells bots which directories and files the webmaster doesn’t want crawled. If you don’t honor this file, webmasters and hosts might block your bot either by user agent or IP address.
Even if you have permission to crawl the site, you still shouldn’t send too much traffic to a website. Some web hosts don’t have the resources to handle large amounts of data, so your bot can affect regular user traffic. If the site goes down, your bot can cost the webmaster sales and organic search engine traffic.
Data mining is difficult to get started, but it’s a great analysis tool. You also need to create reports, which can be as easily difficult when determining how to calculate numbers. With some good analysis, you can create better pages, better layouts and better content than competitors. Make sure you understand data mining before hiring a programmer or creating reports with a class at Udemy.com.