Data science has been the buzzword in the community for over a decade and has been constantly evolving with emerging technologies like Artificial intelligence, Big Data, and Ubiquitous computing. With nationalism at its forefront, all countries are viewing their data as a national resource and are working towards leveraging its potential to their advantage. Data science does not merely deal with analyzing data but also involves processing and extracting information from the collected raw data, presenting the information to the right user in a time-bound manner.
Unstructured data can be in the form of messages, features, photographs, online networking, and other clients. A huge amount of data collected also known as Big Data, undergoes various processes, such as it is first picked, stored, filtered, classified, validated, analyzed and then processed for final visualization.
Broadly speaking the main processes involved in any data science project are as follows
- Data Collection
- Storing and extracting data
- Cleaning and refining data
- Performing statistical analysis
- Visualization with the help of bars and charts
- Preparing Predictive models for insights
The data flow control in a data science process can as follows in the diagram.
In the lifecycle of any data science project data passes through each stage and is transformed into information required.
A Data scientist is a professional who practices the art of data science by using its tools and techniques. Tools are the equipment, or the software used by Data Scientists. Techniques are the set of procedures followed by them to perform the task. Data Scientists use tools and techniques in combination to achieve the best possible outcomes.
Tools are further classified based on the purpose as follows.
- Data collection tools (eg. Semantria, Trackur etc.,)
- Data storage tools (eg. Apache Hadoop, Apache Cassandra, Mongo DB etc.,)
- Data extraction tools (eg. Octoparse, Content grabber etc.,)
- Data cleaning and refining tools (eg. Data cleaner, OpenRefine etc.,)
- Data analysis tools (eg. R, Apache spark, Python etc.,)
- Data visualization tools (eg. Python, Tableau, Orange, Google Fusion Table etc.,)
Some important techniques practiced in Data science include
- Probability and Statistics
- Distribution
- Regression Analysis
- Descriptive Analysis
- Inferential Statistics
- Non-Parametric Statistics
- Hypothesis Testing
- Linear Regression
- Logical Regression
- Neural Networks
- K-means Clustering
- Decision Trees
- Machine learning
- Artificial Intelligence etc.
The Data Scientist Venn Diagram shows that data science is a meeting point of many disciplines such as Communications, Mathematics & Statistics, Computer science & Programming, Business/ Domain etc., This diagram was created by Stephan Kolassa. A Perfect data scientist is one who is strong in domain expertise, programming (also called hacking), statistics and communication. No need to say that it is the highest sought-after job in today’s competitive world.
Some common data science deliverables are as follows.
- Prediction (predict a value based on inputs)
- Classification (e.g., spam or not spam)
- Recommendations (e.g., Amazon and Netflix recommendations)
- Feature extraction based on Statistical learning
- Human action recognition in unknown video and smart video surveillance
- Anomaly detection (e.g., fraud detection)
- Recognition (image, text, audio, video, facial, …)
- Actionable insights (via dashboards, reports, visualizations, …)
- intelligent Decision making
- Scoring and ranking (e.g., FICO score)
- Segmentation (e.g., demographic-based marketing)
- Optimization (e.g., risk management)
- Forecasts (e.g., sales and revenue)
Data Science is a field that arose primarily out of necessity from industry, i.e., in the context of real-world applications instead of as a research domain. The tons of data, that any industry gathers, through meaningful information extraction and discovery of actionable insights, can be used to make critical business decisions and drive significant business change. This in turn helps in the achievement of organizational goals, financial, operational and strategic and so on.
Now, let us look at some of the principal areas of applications and research where data science is currently used and is at the forefront of innovation.
- Business Analytics –Collecting data to check the performance of a business can provide insight into the functioning of the business and help to take the decision-making processes and build predictive models to forecasting future performance, it can be considered to be mutually independent; data science is in universal use in the field of business analytics.
- Prediction – The amounts of data collected and analyzed can be used to identify patterns in data, which can in turn be used to build predictive models. acquire the basis of the field of machine learning, where knowledge is discovered using induction algorithms and on other algorithms that are said to “learn”. Machine learning techniques is used to build predictive models in numerous fields.
- Security – Data obtained from user logs are used to detect fraud using data science. Patterns detected is user activity that can be used to isolate cases of fraud and malicious insiders. Banks and other financial transaction services are used data mining and machine learning algorithms to prevent cases of fraud Generally, the banks use traditional PKI public key infrastructure for security. to achieve high secure communication through the trusted third party, with the increase of network nodes, various problems are appearing in the centralization system of public key infrastructure (PKI). on top of cryptographic problems attacks in the PKI system have focused on the single point of failure of the certificate authority (CA). With the help of Blockchain, a smart contract based on distributed PKI is being explored for better privacy of customers.
- Computer Vision – Data from the image and video analysis is used to implement computer vision, which is the science of making computers “see”, using image data and learning algorithms data, analyze images and take decisions accordingly. This can be used in robotics, autonomous vehicles and human-computer interaction applications.
- Natural Language Processing – Modern NLP techniques use large amounts of textual data from corpora of documents to statistically model linguistic data and use these models to achieve tasks like machine translation, parsing, natural language generation and sentiment analysis.
- Bioinformatics – Bioinformatics is a fast-growing area whereas computers and data are used to understand biological data, such as genetics and genomics. it is used to better understand the basis of diseases, desirable genetic properties, and other biological properties. Biologists no longer use traditional laboratories. however, to discover a novel biomarker for disease, rather they rely on huge and continuously growing genomic data made by various research groups. It is categorized into five types of data that are considered to massive in size and used heavily in bioinformatics research:
- gene expression data
- DNA, RNA, and protein sequence data
- protein-protein interaction (PPI) data
- pathway data
- gene ontology (GO).
Machine learning tools are used to analyze both small-scale as well as large-scale data using various techniques such as sampling, feature selection, and distributed computations. the informatics approaches that integrate many kinds of data with genomic data in disease research, will better understand the genetic bases of drug response and disease.
- Science and Research – Science and Research – Scientific experiments such as the well-known Large generate data from millions of sensors and their data must be analyzed to draw meaningful conclusions. The data from modern telescopes and climatic data stored by the NASA Center for Climate Simulation are other examples of data science being used where the volume of data is too large that it tends towards the is a new field of Big Data.
- Revenue Management- Revenue management is mainly focus on primary setting room prices and optimizing room inventory. Mainly focus on the analytical data to recognize trends and make pricing and inventory management decisions. The scope has expanded as revenue management practice became much more complex, while offering new approaches to enhance revenue. Nowadays, revenue management strategy goes pricing and inventory management, and revenue managers should look for new ways to optimize revenue growth and analyzing performance and forecasting demand.
- Government – The government of the Economy via digitalization will lead to an improved climate for foreign investment, boost economic growth, and ultimately propel the country to the next chapter of its emerging markets story. In the public sector, big data typically refers to the use of non-traditional data sources and data innovations to make government solutions more responsive and effective Tools such as Big Data can be effective in collecting information about financial misappropriation and smarter surveillance. The Biologists by banks can manipulate currency flow, at an individual or a regional level. The promise of data science for better citizen engagement can only be realized if the feedback is effectively aligned with government incentives, mechanisms and processes to take informed action.
In the healthcare sector, data science applications are used in
- Medical image analysis
- Genetics & Genomics
- Drug development
- Virtual assistance for patients and customer support
Other areas where data science is of immense use is internet search, target advertising, website recommendations, advanced image recognition, speech recognition, airline route planning, gaming, augmented reality etc.,
Today the problem of recording, processing and storage of information exists in every field. Big Data has become important for an organization everyday wellbeing and using satellite data for forecasting weather, traffic jams, natural disasters. Data science and Big data influence business sales and marketing. Big data could be used for the prediction of the fluctuating of a certain company and even farming processes transformed into Smart Farming with machines that are equipped with smart sensors and devices and produce large amounts of data that provides unprecedented decision-making capabilities. It attracts more consumers focused on innovations in goods production and services. Now we have possibilities to use smart transport without drivers, smart houses, Internet of things and Cloud Computing and fill more comfortable with data science development. Data science is growing but still contains challenges: Data challenges Example: volume, variety, velocity, veracity, volatility, quality, discovery and dogmatism process challenges; management challenges (privacy, security, governance and ethical aspects. Big Data Analytics requires new advanced algorithms such as text analytics, machine learning, predictive analytics, data mining, statistics, and natural language processing.
- Purpose of the article – For Improving my profession experience
- Intended Audience – Audience Scholar for IT Professional
- References / Sources of the information referred – Dhar, V. Jeff Leek.
- MOURI Tech services link – https://www.mouritech.com/services/data-science-machine-learning-artificial-intelligence-automation
Contact for further details:
Rajendra Bulusu
Application Development- ERP SAP
MOURI Tech