Timon Harz

December 12, 2024

Decoding Unstructured Text with AI: How Artificial Intelligence Extracts Valuable Signals

In this article, we delve into how advanced analytics can transform unstructured text data into clear, actionable insights. Learn how technologies like AI and sentiment analysis can streamline your data processing, enhance decision-making, and uncover valuable business opportunities.

In today’s fast-moving business world, data is essential for driving innovation and progress. By utilizing advanced analytics, artificial intelligence, and emerging technologies, organizations can improve decision-making, streamline operations, identify new opportunities, and turn challenges into successes. A strategic approach to data involves understanding and analyzing various data types—structured, unstructured, and semi-structured—to extract valuable insights. This article explores a case study focused on processing unstructured text data from multiple sources to generate clear, actionable insights.

Additionally, it highlights the challenges faced in manual data processing, such as bias, privacy issues, and inefficiencies, while also emphasizing the advantages of leveraging technology to minimize processing time and enhance decision-making.

The case study provided an opportunity to develop an advanced AI solution that extracts meaningful signals from unstructured text, uncovers actionable insights, and generates summaries that highlight key themes, topics, and sentiments. This is further enhanced with an insight navigation tree. Our innovative methodology follows a six-step process to deliver concise, actionable insights incrementally, while preserving their core meaning.

We describe each component in turn below.

Data Collection and Preprocessing

In the process of batch processing to gather and ingest unstructured text datasets, we examined a range of methods to transform raw data into a structured format suitable for analysis. This was a critical step to ensure the data was clean, coherent, and ready for deeper processing.Our batch processing framework included the use of Small Language Models (SLMs) like Phi 3.5 to tackle challenges such as ambiguous sentences, emoji interpretation, and slang conversion.Ambiguous Sentences
Objective: We used SLMs to clarify ambiguous text, improving coherence and quality through rewriting. This ensured the text was well-organized and easier to analyze. This approach proved particularly helpful for reviews and feedback, where the original content often included irrelevant or unclear details.
Example:

Before: "Hi! The product I bought from https://example.com last week is amazing! But the delivery was late."
After: "Product purchased last week was great, but delivery was delayed."Emoji

Conversion

Objective: Feedback and review texts frequently included emojis that expressed complex emotions but were not always consistently understood by analysis tools. By converting emojis into textual descriptions, we improved sentiment analysis accuracy and captured the true sentiment more effectively. This ensured clearer context and more reliable sentiment interpretation than traditional text analysis methods.

Example:

Before: “The product is 😍”
After: “The product is amazing.”

Handling Slang

Objective: Reviews frequently included slang, abbreviations, or incomplete sentences that could make the text difficult to understand. By incorporating context into the rewriting prompts, we were able to convert slang into standard language and complete fragmented thoughts, enhancing the overall quality and coherence of the text. Clarifying slang and unfinished sentences ensured the message was clear and easy to interpret.

Example:

Before: "The product is lit. But… delivery was late."
After: "The product is excellent, but the delivery was delayed."

Clustering

To manage the vast amounts of unstructured data, we turned to unsupervised machine learning, particularly clustering, to organize the chaotic text into cohesive, manageable groups. This approach allowed us to identify common themes and topics, enhancing the clarity and effectiveness of our analysis.

In document summarization, clustering helped group similar sections together for more efficient extraction of key information. For topic modeling, it uncovered complex hierarchical structures within large text datasets. When analyzing customer feedback, clustering improved efficiency by organizing reviews into categories that highlighted recurring themes and sentiments. Additionally, clustering played a role in spam filtering, social network analysis, content-based profiling, and data reduction, making it a crucial tool for insight extraction and better decision-making.

Although we were confident in the power of clustering, we carefully evaluated the most suitable method for our objectives, comparing options like DBSCAN, hierarchical, and agglomerative clustering.

After a thorough review, we selected agglomerative clustering, which was ideal for exploring the nested structure of our data. This method offered several advantages over alternatives like k-means and DBSCAN:

No need to predefine the number of clusters: Unlike k-means, agglomerative clustering's flexibility allowed us to avoid setting the number of clusters beforehand.
Adaptability to different cluster shapes and sizes: Agglomerative clustering managed clusters of various shapes and sizes more effectively than k-means, which assumes clusters are spherical and similar in size.
Better handling of noise and outliers: While DBSCAN is also effective at dealing with noise, agglomerative clustering’s robustness in this area made it our preferred choice.
Scalability: Although computationally more intensive than DBSCAN, we optimized agglomerative clustering with single and complete linkage to handle our large datasets.
Simpler implementation: DBSCAN required careful tuning of parameters like eps and minPts, which proved challenging. Agglomerative clustering, on the other hand, didn’t require such parameters, making it easier to implement.
Handling varying densities: While DBSCAN struggled with clusters of different densities, agglomerative clustering formed clusters based on similarity rather than density, which allowed us to group text by topics or words.
Deterministic nature: Unlike DBSCAN, agglomerative clustering was deterministic, providing consistent results for a given dataset.

MapReduce Summarization

After clustering the unstructured text into coherent groups, we employed the MapReduce framework to process these clusters in parallel, generating summaries that were both meaningful and representative of the content.

Map Step: The text was divided into smaller chunks or sub-documents and processed in parallel, speeding up the summarization process. Each chunk was summarized independently using a language model.

Reduce Step: The individual summaries from the map step were then combined into a single, cohesive summary that encapsulated the key points from the entire document.

We found the MapReduce summarization technique highly effective for summarizing large collections of unstructured text by first summarizing each chunk (map) and then combining them (reduce), particularly when sub-documents didn’t require significant context from other sections. This method was both scalable and efficient for handling large volumes of text. While we also explored alternative techniques like the Refine Method and Stuff Chain, each presented unique strengths and challenges.

Refine Method: The Refine method streamlined summarization by gradually updating the summary with each new document, making it ideal for continuous data streams.
Stuff Chain Method: The Stuff Chain method processed large documents by splitting them into chunks, summarizing each individually, and then merging them. While it was efficient for large files, it was constrained by the language model’s context window.

Here’s a comparison of these methods against MapReduce:

MapReduce vs. Refine Method

Scalability: MapReduce excelled in scalability, processing data in parallel, whereas the Refine method worked more sequentially.
Complexity: The Refine method was simpler to implement but struggled with large datasets, while MapReduce was better suited for handling vast amounts of data.
Functionality: MapReduce was more robust for large-scale summarization, while the Refine method was ideal for continuous data streams.

MapReduce vs. Stuff Chain

Scalability: MapReduce was more scalable, capable of handling larger volumes of text, while Stuff Chain was limited by the language model’s context window.
Efficiency: MapReduce’s parallel processing made it more efficient for large datasets, whereas Stuff Chain had difficulty with very large corpora.
Flexibility: MapReduce was more adaptable to various data structures and sizes, while Stuff Chain, although straightforward, was less flexible in dealing with complex data.

Attributes Identification

After sourcing and preprocessing the unstructured text data, we utilized agglomerative clustering and MapReduce summarization techniques to organize the data into meaningful clusters. The next critical step involved identifying key attributes within these clusters, specifically extracting prominent topics to provide a high-level overview of each group’s subject matter. This process enabled efficient labeling and deeper insight into the data.

To achieve this, we experimented with several topic modeling techniques, including the use of the BERTopic library. BERTopic leverages transformer embeddings to capture the semantic meaning of the text, ensuring the topics are coherent and contextually relevant. Here's a brief overview of the approach we used:

Embedding the text: We used the SentenceTransformer model to generate embeddings, which capture the semantic meaning of the text.

from sentence_transformers import SentenceTransformer

# Load a pre-trained SentenceTransformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode your text data to get embeddings
embeddings = model.encode(your_texts, show_progress_bar=True)

Topic Modeling with BERTopic: Using the generated embeddings, we applied the BERTopic model to identify topics within the text.

from bertopic import BERTopic

# Initialize BERTopic model
topic_model = BERTopic()

# Fit the model on your text data and extract topics
topics, _ = topic_model.fit_transform(your_texts)

Analyzing the Topics: We extracted detailed information about the identified topics and visualized them for a clearer understanding.

# Retrieve topic information
topic_info = topic_model.get_topic_info()
print(topic_info)

# Retrieve top words for a specific topic (e.g., topic 0)
top_words = topic_model.get_topic(0)
print(top_words)

# Visualize the topics
fig = topic_model.visualize_topics()

This process enabled us to effectively identify key attributes in the data, generating coherent and contextually relevant topics for each cluster.

While BERTopic proved to be a powerful tool for topic modeling, we also explored several other techniques, including Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Non-Negative Matrix Factorization (NMF), and Top2Vec. Additionally, integrating Large Language Models (LLMs) and Small Language Models (SLMs) for prompting further enhanced our results. These models demonstrated exceptional capability in understanding context, nuances, and the relationships between topics, offering a deeper and more comprehensive analysis of the data. Both LLMs and SLMs were valuable for topic modeling: LLMs provided a richer, more context-aware understanding, while SLMs were more efficient and faster for specific applications.

Generating Prime Synopsis

To extract maximum value from unstructured text data, we created a "golden summary" or prime synopsis—essentially a summary of summaries. This provided a concise yet comprehensive overview of all clusters, ensuring that no vital information was missed.

To achieve this, we used advanced techniques like the Chain of Density (CoD) and Chain of Verification (CoVe) prompt engineering methods. The CoD technique improved the quality of AI-generated summaries by gradually increasing the information density without significantly expanding the summary's length. By systematically adding key details while maintaining readability, CoD enhanced the relevance and clarity of the summaries. Meanwhile, the CoVe technique bolstered the reliability of responses from Large Language Models (LLMs) by verifying their outputs. By breaking down complex queries into simpler verification tasks, CoVe improved the overall quality of generated content and fostered trust in AI-generated responses across various applications.

Sentiment Tagging

Alongside summarization, we performed sentiment tagging to identify the emotions (positive, negative, neutral) associated with each text entry. This enabled us to not only determine the topics being discussed but also understand the emotions people had towards them. For example, sentiment analysis highlighted which topics received positive feedback and which were met with dissatisfaction.

To ensure accurate sentiment tagging, we employed several methods to classify texts as positive, negative, or neutral. One of the main challenges in sentiment analysis was dealing with paradoxical sentiments and emojis, which were difficult to categorize into traditional emotional labels.

To tackle this, we used the Chain-of-Thought (COT) reasoning technique. This approach improved LLMs' ability to handle conflicting sentiments by guiding the model through a step-by-step reasoning process. This allowed the model to assess various emotional cues before making a final determination. For example, when analyzing feedback with mixed emotions about a service, COT helped clarify whether the overall sentiment was more positive or negative based on the context and arguments provided.

Custom Evaluation Metrics

Once the text summarization and sentiment analysis processes were developed, it was essential to evaluate their effectiveness to ensure that the summaries and insights were both accurate and actionable without losing their core meaning. To achieve this, we followed a structured approach for evaluation, as outlined below.

Creating the Prime Synopsis and Scorecard

To thoroughly assess the performance of the summarization process, we utilized traditional metrics such as ROUGE and BLEU scores, while also creating two distinct and valuable perspectives:

Prime Synopsis: This comprehensive summary was carefully crafted from the individual summaries of text clusters, synthesizing the entire dataset into a unified and thorough overview. It served as a reference point for evaluating the quality and completeness of the generated summaries.
Scorecard: The scorecard identified the key topics from the text and linked each one to its corresponding sentiment (positive, negative, or neutral). This organized view of sentiment across various topics allowed for a clearer understanding of the emotional tone within the data. By visualizing the sentiment distribution, we were able to quickly pinpoint areas of concern and satisfaction, facilitating more informed and strategic decision-making.

To assess the consistency and completeness of the generated summaries and topics, we relied on the prime synopsis and scorecard. We ensured that the golden summary aligned with key topics and sentiments identified in the scorecard, using LLMs to conduct consistency checks. Topic coverage was another key metric, evaluating how thoroughly the golden summary represented the main topics. These metrics helped us fine-tune our text summarization tool, significantly improving the quality of its outputs. These evaluation methods proved crucial for business stakeholders, as they ensured that the insights drawn from text datasets were reliable and comprehensive, empowering informed decision-making.

For visualization, we implemented an intuitive and interactive framework that organized the insights, making it easier to explore and navigate through the key findings and text. We refer to this as the Distilled Insight Navigation Tree (DINT) in this article.

Use cases and applicability

The combination of text summarization and sentiment analysis offers significant effectiveness across a wide range of use cases, particularly those that involve analyzing large volumes of text data. Here, we present a tested use case where this solution provided valuable insights:

Customer Product Reviews

Use case: We applied our summarization and sentiment analysis tool to a dataset of e-commerce product reviews to assess its capabilities.
Outcome: The tool generated summaries and scorecards that delivered quick, actionable insights into product performance. Key themes such as product quality, shipping experiences, and customer service were clearly highlighted, enabling businesses to identify customer priorities and address urgent issues. For instance, when analyzing tablet reviews, the tool efficiently synthesized feedback to extract crucial insights about product features and user experiences.

Similarly, this approach can be applied to gain insights from customer support, surveys, feedback texts, and other similar use cases.

Learnings and Challenges

During our development of an effective text summarization tool, we encountered several challenges and gained valuable insights. Dealing with noisy data, such as typos, slang, and irrelevant content, required advanced preprocessing techniques to ensure the algorithms focused on the core meaning. Balancing the quality and completeness of summaries was another significant challenge, which we overcame using the golden summary approach and scorecards to systematically assess and refine the summaries. We also faced consistency issues with LLMs, where relevant topics sometimes did not appear in the final summaries, or sentiments were misclassified. To enhance consistency and reliability, we implemented custom evaluation metrics, such as topic coverage and consistency checks, and iteratively refined the prompts and evaluation techniques to improve accuracy and relevance.

Conclusion

Automated text summarization and sentiment analysis offer powerful solutions for businesses dealing with unstructured text data, enabling them to quickly convert complex information into clear, actionable insights. These AI-driven techniques improve efficiency by reducing the time required to process large datasets, allowing businesses to act on insights more swiftly. By identifying key themes and sentiments, organizations can make more informed decisions, address customer pain points, and refine internal processes. While challenges such as noisy data and model inconsistencies exist, careful preprocessing, prompt refinement, and ongoing evaluation can lead to high-quality summaries. Ultimately, these solutions empower leaders to make smarter, data-driven decisions with greater clarity and agility, helping businesses stay ahead in managing large volumes of text data.

Press contact

Timon Harz

oneboardhq@outlook.com