Skip to content
Data Challenges in the AI-ML Journey – Unstructured data

Data Challenges in the AI-ML Journey

Posted on : March 31st 2022

Author : Sudhakaran Jampala

The adoption of artificial intelligence and machine learning (AI/ML) has accelerated due to the availability of cost-effective and near-limitless capacity in data storage and computing power due to cloud services.¹ According to Forrester, 53% of the global data and analytics decision-makers report that they are in some AI/ML journey stage regarding implementing or post-implementation phase.² Gartner estimates that by 2025, 50% of cloud data centers will deploy advanced robots with AI/ML capabilities, resulting in 30% greater operating efficiency.³ Yet there is fear and uncertainty across companies regarding AI/ML projects as an estimated 70% and 85% of data science projects fail.

The need for a data strategy to counter data challenges

A common data strategy for many organizations to counter data complexities, particularly unstructured data, is to deploy a single project to get all its data organized, and this usually involves placing the data into a large data lake. It rarely works as it’s not part of a Well-Architected ML lifecycle.

Exhibit 1: Well-Architected ML Lifecycle

Data Challenges in the AI-ML Journey_Picture1.png

Source: Amazon Web Services

Therefore, it is imperative to analyze and investigate datasets and summarize the main characteristics to run data analytics, typically involving data visualization methods. The process is called Exploratory Data Analysis (EDA), making it easier to discover patterns in the data, identify anomalies, test hypotheses, or check underlying assumptions.

Exhibit 2: Statistical functions and techniques possible with EDA

FunctionBrief Description
Clustering and dimension reductionHelp to develop graphical displays of high-dimensional data containing many variables.
Univariate visualizationUnivariate visualization of each field in the raw datasets and offer summary statistics.
Bivariate visualizationsBivariate visualizations and summary statistics help assess the relationship between each variable in the dataset and the target variable
Multivariate visualizationsMultivariate visualizations to map and understand interactions between various data fields
K-means ClusteringK-means Clustering is a clustering method in unsupervised learning and is commonly used in market segmentation, pattern recognition, and image compression
Predictive modelsPredictive models such as linear regression use statistics and data to predict outcomes

Source: IBM

The data for analysis is rarely available in a readily structured or usable form, and the data might have errors, omissions, and may lack the meta context. To structure the data into a usable format, data scientists use data wrangling for data cleansing, data validation, and structuring the raw data.

Exhibit 3: Key Data Wrangling Activities

ActivityBrief Description
DiscoveringHelp to develop graphical displays of high-dimensional data containing many
StructuringRestructuring the unstructured data by reshaping or merging it for easier analysis
CleaningCleaning the data by making corrections, removing inaccurate data, and ultimately boosting the data quality
EnrichingEnriching additional data to augment the existing data
ValidatingVerifying the data’s consistency, quality, and security
PublishingPushing the treated data down the data pipeline for analytical use

Source: Techcanvass

EDA leads to feature engineering and feature selection. Feature engineering takes raw data from the selected datasets and transforms them into “features” that better represent the underlying problem to be solved. “Features” are arrays of fixed-sized numbers that AI/ML algorithms understand. Feature engineering includes data cleansing, and it can represent the largest part of an AI/ML project in terms of time spent.

The optimum finish

After feature engineering and selection, the next step is training. The process of training and optimizing an ML model is mainly iterative. Training is the most intensive step of the entire life cycle, and maintaining track of the results of each experiment when iterating becomes complex rapidly. Data scientists can face operational frustrations at this stage due to a lack of capacity to record the precise configurations. Tracking tools can simplify the process of remembering the data, the features selected, and model parameters with the performance metrics. Thus, experiments can be compared side-by-side, delineating the differences in performance.

Significant versions of a model need to be captured for possible later use, and this challenge is called reproducibility. The objective is to save enough information about the environment in the developed model so that the model can be reproduced with similar results from scratch. Without reproducibility, the model handover process into production (or DevOps) will be riddled with inefficiencies.


¹ https://d1.awsstatic.com/psc-digital/2021/gc-400/mining-insights-fsi-/AWS_Mining_Intelligent_Insights_with_Machine_Learning_Financial_Services_eBook.pdf

² https://www.mobiquity.com/insights/embarking-on-the-ai-ml-journey

³ https://www.gartner.com/en/newsroom/press-releases/2021-11-01-gartner-predicts-half-of-cloud-data-centers-will-deploy-robots-with-ai-capabilties-by-2025

https://www.servercomputeworks.com/datasheets/AI-Journey-whitepaper.pdf

https://www.ibm.com/downloads/cas/EBJQ6K7M

https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/wellarchitected-machine-learning-lens.pdf

https://www.ibm.com/in-en/cloud/learn/exploratory-data-analysis

https://businessanalyst.techcanvass.com/what-is-data-wrangling-and-exploratory-analysis/

https://itlligenze.com/uploads/5/137039/files/oreilly-ml-ops.pdf

Similar Blogs

The process of data extraction involves identifying and recovering alternative and semi-structured data from various data sources such as files, XMLs, JSON, etc.

Capital markets are an excellent example of a perfect competition. The nature of the market is such the participants have to be competitive and result focussed. For instance, brokerages and investment banks have to deliver passive gains for their clients and, at the same time, earn a margin for themselves.

Today’s ESG analytics require processing data, patterns, and hidden connections to provide insights that investors, asset managers, and companies need. For example, Straive deploys advanced machine learning algorithms to analyze reams of documents to collect evidence across executive statements for signs of vagueness or obfuscation.

Talking about using data to gain insights is easy. But actually doing it will uncover a newer set of challenges, especially when it comes to unstructured data.

Integrating ESG data into commodities trading operations requires structured, easy-to-consume data. By their nature, ESG data resist such integration, and highly scalable data solutions across the data life cycle are needed to allow stakeholders to deploy end-to-end data solutions for a successful data-to-intelligence journey.

We want tohear from you

Leave a message

Our solutioning team is eager to know about your challenge and how we can help.