What is Big Data?
The literal meaning of Big Data refers
to a huge volume of data that needs extraordinary measures for storage and processing.
From simple read/write queries to just a swish on a screen, this revolution in
technology has exponentially increased the amount of data to be stored, because
data now is not only captured via office computers, but also, through phones,
tablets and information sensors. Transactions are completed via apps, cards
swiped, ID checked, surveillance cameras, smartphones, GPS etc. Pretty much all
the electronic devices capture these immutable data in diverse dimensions which
log our past, show our present and help predict the future.
Dimensions of Big Data
Big data can be better understood
through its dimensions. They are defined as Volume, Variety and Velocity.
1.
Volume - It refers to
the amount of data – MB, GB, TB, PB etc.
2.
Variety - It refers to
the type of data, such as text, database, photo, audio, video, social, files
etc.
3.
Velocity – It refers to
the rate at which data is captured which may be in batches, periodic or real
time.
Apart from the above three dimensions of
Big Data, more dimensions of it are showing signs and will soon be actively
considered in testing methodologies. They are Veracity, Validity, Volatility
and Variability.
The available data in Big Data is so
huge in volume that the entire data cannot be tested or analyzed all at once.
Hence the data is tested in batches. These test results are then compared with
the test results of the other batches/segments.
Complex models and sophisticated algorithms have been developed to
accurately validate each dimension of the data. Volume and variety of the data
is validated through functionaltesting and velocity is validated with the help of non-functional testing.
Unlike other RDBMS systems, Big Data systems are based on File Systems. The
data storage is provided by Hadoop Distributed File System (HDFS), a shared
storage system. It provides to store data on a cloud of machines. This can be analyzed using the Map-Reduce
API. The output is then loaded into the system(s) which can be other data warehouses
where the output is further processed.
Therefore, the main steps are:
1.
Loading the data into
the HDFS System – Here, major testing activities are performed. The input data
are compared with the source files.
2.
Performing
test operations using the Map-Reduce API – Here, the business logic is
validated on every node and this process in repeated against multiple nodes.
Unlike SQL where queries are written to run on data, Map-Reduce generate a list
of values from a list of key-value pairs. Then, the output data format is
validated and generated for the next data processing system as per the
requirement. This is based on the theory that moving computation is cheaper
than moving data.
3.
Loading the data output
into other data processing systems – Here, the tester mainly checks for data
integrity between the target data and the HDFS data by ensuring that the
transformation rules were correctly applied on the data and that no data were
lost or damaged in the process.

No comments:
Post a Comment