Data Challenges in the AI-ML Journey – Unstructured data

Data Challenges in the AI-ML Journey

Posted on : March 31st 2022

Posted by : Sudhakaran Jampala

The adoption of artificial intelligence and machine learning (AI/ML) has accelerated due to the availability of cost-effective and near-limitless capacity in data storage and computing power due to cloud services.¹ According to Forrester, 53% of the global data and analytics decision-makers report that they are in some AI/ML journey stage regarding implementing or post-implementation phase.² Gartner estimates that by 2025, 50% of cloud data centers will deploy advanced robots with AI/ML capabilities, resulting in 30% greater operating efficiency.³ Yet there is fear and uncertainty across companies regarding AI/ML projects as an estimated 70% and 85% of data science projects fail.

The need for a data strategy to counter data challenges

A common data strategy for many organizations to counter data complexities, particularly unstructured data, is to deploy a single project to get all its data organized, and this usually involves placing the data into a large data lake. It rarely works as it’s not part of a Well-Architected ML lifecycle.

Exhibit 1: Well-Architected ML Lifecycle

Data Challenges in the AI-ML Journey_Picture1.png

Source: Amazon Web Services

Therefore, it is imperative to analyze and investigate datasets and summarize the main characteristics to run data analytics, typically involving data visualization methods. The process is called Exploratory Data Analysis (EDA), making it easier to discover patterns in the data, identify anomalies, test hypotheses, or check underlying assumptions.

Exhibit 2: Statistical functions and techniques possible with EDA

FunctionBrief Description
Clustering and dimension reductionHelp to develop graphical displays of high-dimensional data containing many variables.
Univariate visualizationUnivariate visualization of each field in the raw datasets and offer summary statistics.
Bivariate visualizationsBivariate visualizations and summary statistics help assess the relationship between each variable in the dataset and the target variable
Multivariate visualizationsMultivariate visualizations to map and understand interactions between various data fields
K-means ClusteringK-means Clustering is a clustering method in unsupervised learning and is commonly used in market segmentation, pattern recognition, and image compression
Predictive modelsPredictive models such as linear regression use statistics and data to predict outcomes

Source: IBM

The data for analysis is rarely available in a readily structured or usable form, and the data might have errors, omissions, and may lack the meta context. To structure the data into a usable format, data scientists use data wrangling for data cleansing, data validation, and structuring the raw data.

Exhibit 3: Key Data Wrangling Activities

ActivityBrief Description
DiscoveringHelp to develop graphical displays of high-dimensional data containing many
StructuringRestructuring the unstructured data by reshaping or merging it for easier analysis
CleaningCleaning the data by making corrections, removing inaccurate data, and ultimately boosting the data quality
EnrichingEnriching additional data to augment the existing data
ValidatingVerifying the data’s consistency, quality, and security
PublishingPushing the treated data down the data pipeline for analytical use

Source: Techcanvass

EDA leads to feature engineering and feature selection. Feature engineering takes raw data from the selected datasets and transforms them into “features” that better represent the underlying problem to be solved. “Features” are arrays of fixed-sized numbers that AI/ML algorithms understand. Feature engineering includes data cleansing, and it can represent the largest part of an AI/ML project in terms of time spent.

The optimum finish

After feature engineering and selection, the next step is training. The process of training and optimizing an ML model is mainly iterative. Training is the most intensive step of the entire life cycle, and maintaining track of the results of each experiment when iterating becomes complex rapidly. Data scientists can face operational frustrations at this stage due to a lack of capacity to record the precise configurations. Tracking tools can simplify the process of remembering the data, the features selected, and model parameters with the performance metrics. Thus, experiments can be compared side-by-side, delineating the differences in performance.

Significant versions of a model need to be captured for possible later use, and this challenge is called reproducibility. The objective is to save enough information about the environment in the developed model so that the model can be reproduced with similar results from scratch. Without reproducibility, the model handover process into production (or DevOps) will be riddled with inefficiencies.


¹ https://d1.awsstatic.com/psc-digital/2021/gc-400/mining-insights-fsi-/AWS_Mining_Intelligent_Insights_with_Machine_Learning_Financial_Services_eBook.pdf

² https://www.mobiquity.com/insights/embarking-on-the-ai-ml-journey

³ https://www.gartner.com/en/newsroom/press-releases/2021-11-01-gartner-predicts-half-of-cloud-data-centers-will-deploy-robots-with-ai-capabilties-by-2025

https://www.servercomputeworks.com/datasheets/AI-Journey-whitepaper.pdf

https://www.ibm.com/downloads/cas/EBJQ6K7M

https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/wellarchitected-machine-learning-lens.pdf

https://www.ibm.com/in-en/cloud/learn/exploratory-data-analysis

https://businessanalyst.techcanvass.com/what-is-data-wrangling-and-exploratory-analysis/

https://itlligenze.com/uploads/5/137039/files/oreilly-ml-ops.pdf

Similar Blogs

Regulators want LIBOR to phased out by December 2021, banks and financial institutes must pivot to risk-free alternative rates.

We have been recognized among the “Top 20 Most Promising Big Data Solution Providers – 2020” in a recent listing by a leading global print magazine. The aforementioned list recognizes an exclusive set of solution providers with a proven track record of consistently delivering customer goals.

The COVID-19 has triggered a rush of clinical trials to discover vaccines, threatening the continuity and success of non-COVID-19 drug discovery pipelines. This guide will help you learn to mitigate these new challenges, maintain pole position, and grow your business into the future with practical strategies for decentralization.

Enterprises tend to employ data from external sources in their data strategy to convert insights into financial gain as they mature in their data journey. This external data comes in diverse forms. However, for enterprises, the most critical is public data.

There are currently no compliance mandate around ESG reporting, especially for private companies, and such reporting is voluntary. While many large companies report on ESG as part of CSR, growing awareness among investors and consumers about ESG has led to this becoming a more widespread practice.

We want tohear from you

Leave a message

Our solutioning team is eager to know about your challenge and how we can help.