8 V’s of BIG Data

Volume, Velocity, Variety, Veracity, Value, Visualization, viscosity, virality

Algorithm

It is a simple term that is absolutely essential when speaking of Big Data. An algorithm is mathematical formula or a set of instructions that we provide to the computer which describes how to process the given data in order to obtain needed information.

Artificial Intelligence

Artificial Intelligence is an intelligence presented by machines. It lets them perform tasks normally reserved for humans, such as speech recognition, visual perceptions, decision making or predict some information.

Business Intelligence

Business Intelligence is a procedure of processing the raw data and looking for valuable information for the purpose of improving and better understanding the business. Using BI can help to make fast and accurate business decisions

Comparative Analytics

I’ll be going a little deeper into analysis in this article as big data’s holy grail is in analytics. Comparative analysis, as the name suggests, is about comparing multiple processes, data sets, or other objects using statistical techniques such as pattern analysis, filtering and decision-tree analytics, etc. I know it’s getting a little technical but I can’t completely avoid the jargon. Comparative analysis can be used in healthcare to compare large volumes of medical records, documents, images, etc. for more effective and hopefully accurate medical diagnoses.

Cluster Analysis

It is an explorative analysis that tries to identify structures within the data. Cluster analysis is also called segmentation analysis or taxonomy analysis. More specifically, it tries to identify homogenous groups of cases, i.e., observations, participants, respondents. Cluster analysis is used to identify groups of cases if the grouping is not previously known. Because it is explorative it does make any distinction between dependent and independent variables. The different cluster analysis methods that SPSS offers can handle binary, nominal, ordinal, and scale (interval or ratio) data.

Data Cleansing

Data cleansing is a process of correcting or removing data or records from the database. This step is extremely important. During collecting data from sensors, websites, or web scraping, some incorrect data may occur. Without cleansing, the user would be at risk of coming to the wrong conclusions after analyzing this data.

Data Lake

A data lake is a repository that stores a huge amount of raw data in its original format. While the hierarchical data warehouse stores information in files and folders, a data lake uses a flat architecture to store data. Each item in the repository has a unique identifier and is marked with a set of metadata tags. When a business query appears, the repository can be searched for specific information, and then a smaller, separate set of data can be analyzed to help solve a specific problem.

Data Mining

It is an analytical process designed to study large data resources in search of regular patterns and systematic interrelationships between variables, and then to evaluate the results by applying the detected patterns to new subsets of data. The final goal of data mining is usually to predict customer behavior, sales volume, the likelihood of customer loss, etc.

Data Scientist

Big Data Scientist is a person who can take structured and unstructured data points and use his formidable skills in statistics, maths, and programming to organize them. He applies all his analytical power such as contextual understanding, industry knowledge, and understanding of existing assumptions to uncover the hidden solutions for business development.

Data Visualization

Data visualization is a proper solution when a quick look at a large amount of information is required. Using graphs, charts, diagrams, etc. allows the user to find interesting patterns or trends in the dataset. It also helps when it comes to validating data. The human eye can notice some unexpected values when they are presented in a graphical way.

Data Warehouse

It is a system that stores data in order to analyze and process it in the future. The source of data can vary, depending on its purpose. Data can be uploaded from the company’s CRM systems as well as imported from external files or databases.

Fuzzy Logic

Fuzzy logic is an approach to logic, in which instead of judging, whether a statement is true or not (values 0 or 1), it tells the degree, how much the statement is close to the truth (values from 0 to 1). This approach is commonly used in Artificial Intelligence.

Internet of Things

Internet of things, IoT, in short, is a conception of connecting devices, such as house lighting, heating or even fridges to a common network. It allows storing big amounts of data, which can later be used in real-time analytics. This term is also connected with a smart home, a concept of controlling a house with a phone, etc.

Machine Learning

Machine Learning is the ability of computers to use them without programming new skills directly. In practice, this means algorithms that learn from data when processing them and use what they have learned to make decisions. Machine learning is used to exploiting the opportunities hidden in big data.

Metadata

Metadata is data that describes other data. Metadata summarizes basic information about data, which can make finding and work with particular instances of data easier. For example, author, date created and date modified and file size are very basic document metadata. In addition to document files, metadata is used for images, videos, spreadsheets, and web pages.

Neural Network

Neural networks are a series of algorithms that are recognizing relationships in datasets, through a process that is similar to the functionality of the human brain. An important factor of this system is that it can generate the best possible result, without redesigning criteria for the output. Neural networks are very useful in financial areas, for instance, they can be used to forecast stock market prices.

Queries

Queries are the questions used in order to communicate with the database. Usually, iterating over big datasets would be very time-consuming. In such a case, queries can be created i.e. database can be asked to return all records where a given condition is satisfied.

Spark (Apache Spark)

Apache Spark is a fast, in-memory data processing engine to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. Spark is generally a lot faster than MapReduce that we discussed earlier.

SQL

SQL is a language used to manage and query data in relational databases. There are many relational database systems, such as MySQL, PostgreSQL or SQLite, etc. Each of these systems has its own SQL dialect, which slightly differs from others.

TensorFlow & Keras

By default, Python does not have any implementations of machine learning algorithms or data structures. A developer needs to implement them by themselves, or use already prepared libraries such as TensorFlow or Keras. Tensorflow is an open-source library for symbolic math calculations as well as machine learning. It has implementations for many languages, like Python, Javascript, etc. Code written with is low-level, which can give reasonably high performance. Keras is also an open-source library used with Python, however, code is more high-level, which makes the library itself more friendly for machine learning beginners than TensorFlow.