Human-AI Tools for Contextualizing Differences: Bridging Data-Driven Insights with Real-World Interpretability

1 Introduction

Understanding how user experiences vary by region is key to delivering personalized, context‑aware insights. For example, a coffee shop ideal for working in one city may serve primarily as a social hub elsewhere. Existing recommendation and retrieval systems typically treat user behavior as uniform, overlooking how local customs, climate and social norms shape activities, emotions and expectations.

We address this gap by building a human–AI system that:

Analyzes millions of user reviews (Yelp Open Dataset: 6.9 M reviews, 150 K businesses across 11 metro areas) using BM25 scoring to control for document length and term frequency.
Surfaces location‑specific patterns—e.g. comparing “sunbathing” at beaches in Florida vs. Pennsylvania.
Visualizes results with interactive bar and line charts, plus adaptive relevance thresholds, so both technical and non‑technical users can grasp regional differences at a glance.

Our contributions:

System: A pipeline that combines BM25, dynamic thresholds and accessible visualizations to highlight geographic, cultural and social conceptual shifts.
Technical: A scalable backend (Python, PostgreSQL) that precomputes BM25 scores and thresholds for rapid querying.
Conceptual: A framework showing how place‑based habits emerge from local context—and how to integrate that into IR and recommendation models.

2 Related Work

2.1 Context‑Aware Recommendations

Environmental factors (location, time) influence preferences and mood [Suhaim 2021].
Cultural and socio‑economic dynamics shape e‑commerce recommendations but are rarely modeled in real time [Yohe 2023].
Graph‑based methods for location‑based social networks improve over popularity or CF by integrating user, place and context [Khazaei 2019].

2.2 Bias in AI

Large language models reflect and amplify biases present in training data—gender, race, cultural stereotypes—which hinder fair contextual reasoning [Mehrabi 2021; Gallegos 2024].
Cultural nuance remains difficult to encode; purely LLM‑based approaches require extensive fine‑tuning and still miss localized meanings [Li 2024].

2.3 Information Retrieval & BM25

BM25 outperforms TF‑IDF on variable‐length documents by applying non‑linear term weighting and document‑length normalization [Lv 2024].
Widely used in search engines, digital libraries and retrieval‑augmented generation, BM25 offers a robust basis for our region‑comparison use case.

3 Design Space

3.1 Design Problem

Users without technical expertise struggle to interpret raw frequency or similarity scores when exploring unfamiliar locales. LLMs can assist but often inject bias and lack transparency. We need a system that:

Adapts to geographic, cultural and social diversity.
Balances accuracy, speed and interpretability.
Presents findings in an intuitive, accessible format.

3.2 Design Goals

Inclusivity: Avoid AI‑driven inference biases by using purely statistical retrieval (BM25).
Interpretability: Translate scores into categories—“not relevant,” “marginal,” “relevant”—using location‑specific thresholds.
Usability: Deliver results in seconds via clear visualizations rather than cryptic numbers or complex narratives.

4 System Description

4.1 User Interface & Workflow

Select source/target locations (state or city).
Choose an activity (single‑word query).
View side‑by‑side bar charts (top categories) and line charts (score distributions with “no engagement,” “significant,” and “relevance” thresholds).

User Workflow

4.2 Technical Pipeline

4.2.1 Data Processing & Storage

Raw JSON → Parquet → PostgreSQL (12 tables).
Text cleaning: regex, lemmatization (NLTK), spell‑correction (SymSpell), stopword removal.
Parallel batch processing in Python for scalability.

4.2.2 Contextual Difference Computation

TF‑IDF vs. BM25: BM25’s $k_1$ and $b$ hyperparameters normalize term frequency against document length, yielding stable, meaningful rankings across locations.
Conceptual shift: $\Delta = |BM25_{\text{source}} - BM25_{\text{target}}|$

4.2.3 Threshold‑Based Interpretation

No Engagement: 5th percentile of scores
Significant Shift: 15 % of (max – min)
Relevance: 60 % of (max – min)
Converts raw BM25 into “not relevant,” “marginal,” “relevant,” or “not found.”

4.2.4 Implementation Stack

Backend: Python, Numba/NumPy for BM25, PostgreSQL
Frontend: Next.js + MUI, TypeScript, Recharts, TanStack React Query, Prisma ORM
Visualization: Accessible color palette (WCAG 2 compliant) for bar/line charts

5 Discussion

5.1 Key Insights

Adaptive thresholds ensure meaningful comparisons across diverse contexts.
Graphical summaries (bar/line charts) democratize access for non‑technical users.
The pipeline balances speed, accuracy and interpretability.

5.2 Challenges

BM25 lacks semantic nuance (polysemy, idioms).
Yelp data may be unevenly distributed, requiring multi‑source integration to mitigate sampling bias.

5.3 Limitations & Future Work

Integrate BERT‑based re‑ranking for semantic refinement.
Expand beyond Yelp (Google Reviews, social media) for richer context.
Support real‑time indexing and updates.
Conduct user studies to validate interpretability and decision support.

Abstract