Wednesday, 25 May 2016

Big Data and its testing framework.



Big Data is much of a buzzword in the technical industry. However, many people are still confused about the differences between Big Data and the traditional forms of Data Warehousing. Through the expansion of several technology innovations the old separation, which used to exist between the online and offline world is slowing eroding away. Two big factors are driving this. One is the huge expansion in mobility and the fact that people are increasingly using their mobile devices in the real world. The other factor is the Internet of Things. The simple devices have turned much more effective today and can be connected to the Internet. This is leading to generation of huge amounts of data. 




Big Data is much more than just data itself. It is characterized by volume, velocity, variety and veracity; each of which is much more in size we have ever handled. The management of Big Data, thus, involves different methods for storing and processing data. With the amount of data volume involved, there is a high likelihood of bad data embedded in the mix. This increases the challenges of quality assurance testing. 
 
Big Data actually talks about the questions that can be found in one's business. The question can be about the future investments and their viability. Big Data gives us answers to such questions before making decisions. Hence, testing Big data is not about analyzing data, but also to pose queries about future business.

Some of the biggest challenges of Big Data are as follows:

-      Seamless Integration: It is a great challenge to integrate the BI framework with Big Data components, because the search algorithm may not be in a format as per the valid data warehousing standards. This runs the risk of poor BI Integration. Consequently, the quality of the overall system cannot be assured because of the operational disparity of the BI and the Big Data Systems. 

-      Data standards and data cleanup: Big Data systems lack the precise rules for filtering out or scrubbing out bad data. Its data sources are usually unheard of and are usually unconventional sources such as sensors, social media inputs, surveillance cameras, etc. 

-      Real Time Data processing: Big Data solutions need accuracy in high density data volumes and make them BI friendly. The Big Data set processing frameworks execute them in smaller clusters or nodes. This is done in batches and not exactly in real-time mode.
Considering the challenges in putting an appropriate strategy into practice for assuring quality, the following seven features need to be considered for an ideal end-to-end Big Data Assurance Framework. 

-      Data Validation: The ecosystems for Big Data need to be able to read data of any size, from any source and at any speed. The framework should be able to validate the data drawn from a variety of unconstrained range of sources, structured data such as text files, spreadsheets, xmls, etc. and unstructured data such as audio, video, image, Web and GIS.

-      Initial data Processing: The Big Data system should be able to validate initial unstructured data processing and use an appropriate processing script for real time data analysis and communication.

-      Processing logic validation: The data processing language defined in the query language should be able to clean and organize the huge unstructured data into clusters. The framework should be able to validate the raw data with the processed data in order to establish data integrity and prevent data loss.

-      Data processing speed testing: The Big Data Assurance framework should integrate seamlessly with tools like ETL validator, Query Surge, etc. Such tools help to validate data from source files and other databases, compare with the required data and rapidly formulate reports after analysis. It should also enable high speed processing of bulk data, as speed is very crucial in real-time digital data analysis.

-      Distributed file system validation: Validation is the key for Big Data based frameworks. This validation feature helps the framework in flowing parameters such as velocity, volume and variety of data in each node.

-      Job Profiling validation: Job profiling is required in Big Data frameworks prior to the processing of unstructured data algorithms, as errors in such algorithms raise the chances for job failure.

-      Field to field map testing: Massive architectural makeovers take place while migrating from legacy to Big Data systems.  The solution-specific integration testing is carried out with a field-to-field mapping and schema validation as a very important part.

Conclusion

For leveraging the full patenting of Big Data, the main key is to work out an effective validation strategy. The success of a business is directly proportional to the effectiveness of the testing strategy of validating structured and unstructured data of large volume, variety and velocity. Such solutions not only will help in  building a framework for data validation over an existing data warehouse, but But also empower them to adopt new approaches for faster and more meaningful analysis of bulk data.

Thursday, 23 July 2015

Big data - Its dimensions and Testing Overview



What is Big Data?

The literal meaning of Big Data refers to a huge volume of data that needs extraordinary measures for storage and processing. From simple read/write queries to just a swish on a screen, this revolution in technology has exponentially increased the amount of data to be stored, because data now is not only captured via office computers, but also, through phones, tablets and information sensors. Transactions are completed via apps, cards swiped, ID checked, surveillance cameras, smartphones, GPS etc. Pretty much all the electronic devices capture these immutable data in diverse dimensions which log our past, show our present and help predict the future.




Dimensions of Big Data

Big data can be better understood through its dimensions. They are defined as Volume, Variety and Velocity. 

1.      Volume - It refers to the amount of data – MB, GB, TB, PB etc.

2.      Variety - It refers to the type of data, such as text, database, photo, audio, video, social, files etc.

3.      Velocity – It refers to the rate at which data is captured which may be in batches, periodic or real time.

They are managed with the help of advanced technologies like, distributed storage system, memory database and high capacity storage systems.  The figure will further illustrate the management of data elaborately.
 
Apart from the above three dimensions of Big Data, more dimensions of it are showing signs and will soon be actively considered in testing methodologies. They are Veracity, Validity, Volatility and Variability.


The available data in Big Data is so huge in volume that the entire data cannot be tested or analyzed all at once. Hence the data is tested in batches. These test results are then compared with the test results of the other batches/segments.  Complex models and sophisticated algorithms have been developed to accurately validate each dimension of the data. Volume and variety of the data is validated through functionaltesting and velocity is validated with the help of non-functional testing. Unlike other RDBMS systems, Big Data systems are based on File Systems. The data storage is provided by Hadoop Distributed File System (HDFS), a shared storage system. It provides to store data on a cloud of machines.  This can be analyzed using the Map-Reduce API. The output is then loaded into the system(s) which can be other data warehouses where the output is further processed.
Therefore, the main steps are:

1.      Loading the data into the HDFS System – Here, major testing activities are performed. The input data are compared with the source files.

2.      Performing test operations using the Map-Reduce API – Here, the business logic is validated on every node and this process in repeated against multiple nodes. Unlike SQL where queries are written to run on data, Map-Reduce generate a list of values from a list of key-value pairs. Then, the output data format is validated and generated for the next data processing system as per the requirement. This is based on the theory that moving computation is cheaper than moving data.
 
3.      Loading the data output into other data processing systems – Here, the tester mainly checks for data integrity between the target data and the HDFS data by ensuring that the transformation rules were correctly applied on the data and that no data were lost or damaged in the process.