The main significant for undertaking this research is to make a contribution to the scholarly understanding by coming up with a new model and introducing innovative interpretations to this model so as to present a fast and reliable means of data storage and retrieval. The research will not only contribute to the scholars but to business organizations as well. This is because this is the most affected sector since most businesses handle lots of data for their clients and products and services they offer. While most of this data is in unstructured state, most organizations feel that it is not well protected. A lot of this data is sensitive and confidential thus it must be protected using apposite measures. In addition to this, the data should support numerous distribution needs since it is shared across several platforms and between partners and entities as well.
The handling and management of unstructured data is documented as one of the chief unsettled problems in the information technology industry (9). This is because; the techniques and equipment that proved successful in handling structured data have proved to be useless when dealing with unstructured data since it is bulky and sophisticated (9). This has consequently created an equally need for equipment and techniques for managing unstructured data that not all governments can afford. Inefficient and irrelevance in search; one of the basic requirements for unstructured data is that it should be searchable. Blumberg & Atre (9) argues that before the web came into use, files and documents would be searched using full-text and other search systems. Conversely, with the rapid increase in the use of internet, these searching tools became inefficient enough to serve the high demand (9). Several research studies have discovered that the standard size for search phrases that are used on communal webs is only about 1.5-2.5 words. Research has also shown that the standard search contains very few competent Boolean operators, actually less than ten percent of the actual time. With such undersized phrases and very little application of sophisticated search techniques, it is clear evidence that the results are poor or irrelevant. In addition to these, the search machines treat every search application separately. This therefore means that same results are given for a particular search phrase even if the context is different (9).
According to Blumberg & Atre (9), the inconsistency in methodologies, frameworks and the taxonomies used in organization and classification of data shows that there is no suitable approach for interpreting and modeling unstructured data. Instead on there being one reliable and efficient standard, several models are coming up and each model has its own values and insights. This divergence therefore poses a major problem in the future since it is not certain which model will actual be able to effectively handle the unstructured data. Duplicating computing displays the use of internet and other data extraction systems so as to achieve reliable standards (9).
Raghavan (16) classifies data using a two-class classification system using standard queries; the filtering and routing. He asserts that classification is usually general focused and the process with which such classification is performed is called text classification (15). According to Mena (10), unstructured data can best be managed when organized in hierarchical systems commonly referred to as taxonomies. Taxonomy can be described as a hierarchical structure of classification whereby it moves down from extensive to explicit. This structure acts as a directory on a PC in the sense that it provides an expedient and instinctive way to maneuver and acquire information easily. This therefore means that; instead of formulating a query and then evaluate the results, one can directly access the relevant information by simply formulating queries on the appropriate classes and subclasses. Another advantage with these taxonomies is that they limit the queries to specific classes and sub classes (10). A well organized taxonomy may comprise of eighty to ten levels with hundreds or even thousands of classes.
The Confidentiality Integrity & Availability (CIA) Assurance Model is a commonly used model which recognizes the Confidentiality Integrity and Availability of data as the elementary safety characteristics of information particularly the unstructured data (11). Confidentiality is described as the guarantee of data discretion, the purpose of which is to ensure that only authorized and the intended processes, people or devices can have access to the data (9). This is only achievable when cryptography is implemented. Integrity is the assurance once data is stored, it can not be altered in transition; and the dispatcher of the data must be the person intended to be. To ensure that data is not altered or corrupted on transition, hash algorithms and digital signatures are used (14). Availability is the assurance that data can be accessed by the user in a reliable and timely manner. This guarantees a fast availability of data on request (9). While integrity and availability are maintained, invaders may cause such information to become less accessible or completely unavailable.