We present Pontifex, a novel architecture that unifies two techniques for rapid, general-purpose semantic probing across languages and representation spaces. Pontifex combines (i) ultra-fast byte-level occlusion with bilateral semantic comparison and (ii) convergent multi-space semantic investigation via neural convergence layers. By occluding raw byte sequences and comparing the resulting semantic representations on both sides of the occlusion, Pontifex efficiently identifies influential input segments. Simultaneously, it conducts parallel inquiries in multiple embedding spaces and learns to converge their semantic evidence without requiring explicit transformations between spaces. In experiments, Pontifex achieves an order-of-magnitude speedup over token-level and LLM-based interpretability methods, while preserving semantic consistency across languages. It outperforms standard embedding probing techniques in cross-lingual and cross-modal benchmarks, aligning diverse embeddings to reveal shared concepts. We discuss how Pontifex’s cross-space agreement mechanism yields more robust and language-agnostic interpretability, and we outline future directions for extending this approach to multimodal convergence and unsupervised hypothesis generation of semantic features.
Large pre-trained models learn rich semantic representations, but probing these representations for insights—especially across different languages or modalities—remains challenging. Traditional interpretability methods like feature ablation and representation probing are often confined to a single model or language at a time, making cross-representation analysis cumbersome. Moreover, approaches that rely on model-specific tokenization or prompting large language models (LLMs) can be slow and difficult to generalize. There is a growing need for fast, general-purpose semantic probing that can operate uniformly across diverse inputs and embedding spaces. For example, a truly language-agnostic probe should handle an English sentence and its Japanese equivalent with equal ease, and identify which parts of each are semantically pivotal – ideally without retraining or extensive parallel data.
Existing solutions only partially address this need. Parametric probing with linear classifiers has been widely used to test what information is encoded in embeddings, but these methods typically require training a new probe per task or language, and they don’t directly compare different embedding spaces. Embedding alignment techniques map one model’s embedding space to another (e.g. aligning multilingual word vectors), but they often demand bilingual dictionaries or joint training and can struggle with non-linear differences. On the other hand, one might simply use a powerful LLM to introspect representations or explain model decisions in natural language. However, LLM-based investigations are costly and can be inconsistent – studies have found that even when LLMs are prompted to explain their own predictions, the “self-explanations” may not faithfully reflect the model’s true decision process. In short, purely model-specific or sequential approaches fail to provide a rapid and unified semantic probe across heterogeneous systems.
In this work, we introduce Pontifex, an architecture designed to bridge semantic investigations across multiple representation spaces. Pontifex rests on two key innovations. First, it employs byte-level occlusion combined with bilateral semantic comparison as a fast, language-agnostic interpretability technique. By manipulating raw bytes of input (rather than language-specific tokens) and comparing the semantic effect from both sides of the occluded segment, Pontifex can pinpoint influential subsequences in an input with minimal preprocessing overhead. This enables a single framework to probe inputs in any language or format that can be byte-encoded, leveraging the robustness of byte-level models to noise and diverse scripts. Second, Pontifex introduces convergent multi-space semantic investigation, wherein multiple embedding spaces are queried in parallel and their findings reconciled through neural convergence layers. Instead of translating representations from one space into another (which risks losing information and requires extensive training data), Pontifex treats each embedding space as an independent “expert” that evaluates the same semantic hypothesis. A trainable convergence mechanism then identifies agreement or conflict between spaces to infer the underlying semantic truth. This approach mirrors how humans reconcile information from different experts or languages: by focusing on the consistent meaning beneath different representations.
By unifying these techniques, Pontifex achieves rapid and cross-validated semantic probing. Our contributions are as follows: (1) We formalize a byte-level occlusion method with bilateral comparison that yields multiple signals per occlusion, improving efficiency and informativeness. (2) We propose neural convergence layers that learn to combine similarity signals from disparate embedding spaces, enabling direct cross-space semantic agreement checks without explicit embedding alignment. (3) We implement Pontifex and evaluate it on a variety of benchmarks, including cross-lingual semantic similarity and multimodal concept alignment tasks. Pontifex consistently demonstrates higher semantic consistency across languages and faster convergence to correct interpretations than baseline probing methods. (4) We analyze the strengths and limitations of Pontifex relative to contrastive representation learning, embedding alignment, and LLM-based explainability, outlining scenarios where each is advantageous. Finally, we discuss potential improvements (such as more sophisticated hypothesis generation) and future directions, notably extending Pontifex to truly multimodal settings and using its convergent probing for unsupervised discovery of semantic features.
Contrastive Representation Learning and Embedding Alignment: Our work is related to representation learning approaches that align semantic information across domains. Contrastive learning methods (e.g. SimCLR, CLIP) train models to bring semantically similar inputs closer in embedding space while pushing apart dissimilar ones. Notably, multimodal models like CLIP achieve cross-modal alignment by using a contrastive loss on image–text pairs, effectively unifying two representation spaces (vision and language) into a shared semantic space. Pontifex shares the goal of cross-domain semantic consistency but approaches it differently: rather than training a single shared embedding space, Pontifex keeps multiple pre-existing spaces and finds agreement between them post hoc. Traditional embedding space alignment techniques (especially in multilingual NLP) learn a linear mapping or orthogonal transformation to project one language’s word embeddings onto another’s. For example, an English word vector space can be aligned to Spanish via a learned rotation (Procrustes analysis) given a bilingual dictionary. While effective with sufficient parallel data, such methods assume a roughly isomorphic structure between spaces and can falter if the relationship is highly non-linear. Adversarial alignment methods relax the need for dictionaries by using a GAN to align distributions, but they require careful tuning and can suffer from instability (e.g. mode collapse). In contrast, Pontifex avoids explicit coordinate mapping altogether. Our neural convergence layers do not produce a single transformed embedding; instead, they learn to interpret similarity measures from each space and output a confidence in semantic equivalence. This is a fundamentally different paradigm: rather than merging spaces, we maintain separate views and seek consensus between them. This approach is inspired by the observation that semantic relationships can be detected across spaces even if the embeddings themselves lie in different geometries. By focusing on agreement in pairwise similarities (e.g. which hypothesis is close to a target in each space) rather than agreement in raw coordinates, Pontifex sidesteps many problems of direct embedding alignment.
Occlusion-Based Interpretability: Occlusion and ablation techniques are classic tools for model interpretability. In computer vision, occlusion involves masking out parts of an image to see how the model’s predictions change, thereby inferring which regions are important. Zeiler and Fergus’s seminal work systematically occluded image patches and showed that the classifier’s confidence drops when important object parts are masked, effectively localizing discriminative features. They also compared internal feature maps for original vs. occluded images to understand feature correspondence. In NLP, analogous approaches remove or replace words to measure their impact on a model’s output (sometimes called feature erasure). For instance, removing a particular word from an input and observing the change in predicted probability can indicate that word’s importance. Li et al. (2016) defined Occlusion in text as the difference in model prediction when a word is deleted, holding others constant. Such occlusion-based saliency methods are simple and model-agnostic: they do not require access to gradients or internal weights, only the ability to query the model with perturbed inputs. However, token-level occlusion can be slow – one must test many perturbations – and tokenization itself is language-dependent. Pontifex advances occlusion-based analysis in two ways. First, by operating at the byte level, it forgoes language-specific preprocessing, making the approach inherently multilingual and even applicable beyond text (e.g. to binary data or code) as long as an embedding model is available. Recent tokenizer-free models like ByT5 demonstrate that byte-level processing can handle over 100 languages and is robust to noise like typos. We leverage this robustness by treating raw bytes as the unit of occlusion. Second, Pontifex introduces a bilateral semantic comparison strategy: instead of occluding a segment and feeding the truncated input back into the model (which for text might yield an ungrammatical sequence), we consider the two contexts created by the occlusion – the left fragment and the right fragment – as separate inputs. By embedding each fragment independently, we obtain two partial representations of the original input’s meaning. Comparing these fragment embeddings to each other and to the full input’s embedding provides rich information about the occluded portion’s contribution. Intuitively, if occluding a segment removes crucial semantic content, the left and right fragments’ embeddings will diverge from each other and from the original; if the segment was unimportant or redundant, the fragments might still jointly carry similar meaning. This bilateral approach draws on similar logic as in vision (where one compares feature maps of original vs occluded images), but Pontifex extends it with a formal loss-based framework (described in the next section) to quantify semantic differences.
LLM-Based Explanations: Finally, we distinguish Pontifex from methods that use large language models to probe or explain representations. With the advent of powerful LLMs, a trend in explainability is to have the model generate explanations or rationales for its outputs. For example, one can prompt an LLM to highlight important words or explain a prediction in plain language. Such approaches can be appealing – they leverage the model’s internal knowledge – but recent research shows mixed results. Chan et al. (2022) and others have noted that LLM-generated feature attributions (such as which tokens were most influential) can sometimes “trick” evaluators or misrepresent the true decision process, especially if the model learns to game the explanation metric. A recent study rigorously comparing LLM self-explanations with traditional methods found that while explanations in the form of chain-of-thought can correlate with reasoning, they often do not align with occlusion-based importance in a one-to-one manner. In fact, disagreements between LLM explanations and occlusion or SHAP values are common, raising concerns about faithfulness. Moreover, using an LLM in the loop is computationally expensive – as evidenced by our benchmarks, an API-based LLM analysis can take tens of seconds and incur significant cost. Pontifex avoids natural language generation; it stays in the embedding domain, seeking rigorous numeric indicators of importance and cross-space semantics. While one could integrate Pontifex’s findings with an LLM (e.g. to verbalize insights), our focus is on a transparent, efficient algorithm that can validate model semantics through measurable changes and agreements. In summary, Pontifex relates to a broad landscape of representation analysis techniques, but its combination of byte-level perturbation and multi-space convergence sets it apart from prior art.
Pontifex comprises two main components: (A) a Byte-Level Occlusion Engine with bilateral comparison, and (B) a Multi-Space Convergence Mechanism realized via neural convergence layers. In this section, we formally define each component and how they work in concert.
Occlusion Process: Let \$x\$ be an input (e.g. a sentence or data sequence) and \$f(x)\$ the semantic representation of \$x\$ given by some embedding model or encoder. In Pontifex, \$x\$ is treated as a sequence of raw bytes. We define an occlusion by choosing a contiguous byte segment \$x[i:j]\$ to remove or mask. Unlike token masking in BERT-like models, we do not substitute a learned mask token (since our aim is model-agnostic probing); instead, we conceptually split the input into two parts: the left context \$x_\ell = x[:i]\$ (bytes before the occlusion) and the right context \$x_r = x[j:]\$ (bytes after the occlusion). For example, if \$x =\$ "The quick brown fox jumps over the lazy dog", an occlusion might remove the bytes corresponding to "fox", yielding \$x_\ell =\$ "The quick brown " and \$x_r =\$ " jumps over the lazy dog". We then obtain embeddings for each fragment: \$e_\ell = f(x_\ell)\$ and \$e_r = f(x_r)\$. Here, \$f\$ could be any encoding model suitable for the data (in our experiments, a transformer encoder for text). By operating at the byte level, this procedure applies uniformly across languages – there is no need for language-specific tokenizers, and the occlusion can target any substring of bytes (including parts of multi-byte characters, which we handle by decoding with error-tolerant methods as needed). In practice, we generate multiple occlusions per input, often randomly, to sample different segments and sizes. This yields a set of left/right fragment pairs for analysis.
Bilateral Semantic Comparison: Given a particular occlusion that produced fragments \$x_\ell\$ and \$x_r\$, we seek to measure how much semantic content was lost by that occlusion. We leverage bilateral comparisons in the embedding space to do so. First, we compare the two fragment embeddings to each other: for example, using cosine similarity \$\text{sim}(e_\ell, e_r)\$. If removing the segment splits the meaning into two disjoint pieces, \$e_\ell\$ and \$e_r\$ will encode different aspects and their similarity will be low. Conversely, if the occluded segment was redundant or the two sides still carry the same overall theme, the similarity will be higher. Next, we compare each fragment embedding to a reference embedding of the original input (or an approximation of it). Let \$e = f(x)\$ be the embedding of the full input (when available). We compute \$\text{sim}(e_\ell, e)\$ and \$\text{sim}(e_r, e)\$. These indicate how well each fragment alone preserves the original meaning. A significant drop in these similarities (relative to the original self-similarity of 1.0) signals that important information was in the missing segment.
We can formalize an occlusion importance score from these comparisons. One simple formulation is:
$I_{i:j}(x) = 1 - \frac{1}{2}\Big[\text{sim}(e_\ell, e) + \text{sim}(e_r, e)\Big] \cdot \text{sim}(e_\ell, e_r),$
which increases (towards 1) when either fragment deviates from the original or when the fragments diverge from each other. In our implementation, we found it useful to frame the problem as a loss minimization for analysis: we define a contrastive loss \$L_1\$ that encourages \$e_\ell\$ and \$e_r\$ to be close if they carry complementary information (or penalizes their distance), and convergence losses \$L_2, L_3\$ that penalize differences between each fragment and the full input embedding. Specifically,
- \$L_1 = d(e_\ell, e_r)\$ (a distance metric, e.g. \$1 - \cosine(e_\ell,e_r)\$),
- \$L_2 = d(e_\ell, e)\$, and
- \$L_3 = d(e_r, e)\$,
and an overall “occlusion loss” \$L_{\text{occ}} = \alpha L_1 + \beta L_2 + \gamma L_3\$ aggregates these. Intuitively, \$L_{\text{occ}}\$ will be small if both fragments remain similar to the original (small \$L_2, L_3\$) and to each other (small \$L_1\$), implying the occluded segment had little unique effect. Conversely, if the occlusion disrupts the meaning, one or more terms will be large. We do not actually backpropagate into the model with this loss; instead, we use it as a quantitative measure. However, thinking in terms of a loss is convenient when summing over many occlusions or even when fine-tuning a small auxiliary model to predict important segments. Indeed, one advantage of our bilateral setup is that each occlusion provides multiple signals (from \$L_1, L_2, L_3\$) about the input, as opposed to a single change in output probability as in standard occlusion. This “wider” feedback can potentially be used to update a probe or guide an interpretability model. In our experiments, we sample numerous occlusions (e.g. 100 random occlusions with segment sizes varying 5–50% of the input) and aggregate their outcomes to identify which byte positions consistently yield high importance scores. Notably, because this method does not rely on any particular output prediction, it generalizes to non-prediction settings (like analyzing embedding content itself). It is also extremely fast: by batching the embedding computations for many occlusion fragments, our PyTorch implementation achieves significant throughput. A typical analysis of a sentence with 100 occlusions completes in under 0.5 seconds on a GPU, compared to several seconds for token-wise occlusion and tens of seconds for an LLM-based explanation.
While byte-level occlusion focuses on one model’s embedding space at a time, Pontifex’s second pillar is to link multiple embedding spaces in the analysis. The goal is to leverage different models or modalities as cross-checks to achieve a more robust understanding. For instance, suppose we have an English sentence and we can obtain embeddings from a multilingual language model and from an image-caption model (which might encode a visual scene described by that sentence). Each model offers a different perspective on the sentence’s semantics. Pontifex asks: do these models agree on what the key semantic attributes are? If so, that increases our confidence that those attributes are truly important (not just an artifact of one model). If they disagree, the nature of the disagreement might itself be informative (perhaps one model picks up stylistic tone while another focuses on factual content).
Parallel Embedding Spaces: Formally, assume we have \$k\$ embedding spaces \$E_1, E_2, ..., E_k\$, each with an encoding function \$f_t: X \to E_t\$ that maps an input (from domain \$X\$, e.g. text or other) to an embedding in space \$E_t\$. Pontifex is flexible in that \$E_t\$ could be different modalities or simply different models for the same modality. We consider a particular target input \$x\$ (our subject of investigation) and its embeddings \$e_t = f_t(x)\$ in each space. Now, rather than investigate \$x\$ in one space at a time (and then try to translate findings), Pontifex conducts simultaneous investigations in all spaces. Concretely, the byte-level occlusions described above can be applied in each space’s input domain. If the spaces share the exact same input (e.g. two language models both take the English sentence), we can use the same occluded text for both. If the spaces are different modalities (say text and image), we need analogous perturbations in each (e.g. occlude part of the text and occlude part of the image). In either case, we generate hypotheses or questions about the input’s semantics and evaluate them in all spaces in parallel. A “hypothesis” here might be something like “the concept dog is present” or “this input is about sports” – anything that can be framed as a feature whose presence can be tested via similarity. For each hypothesis \$h\$, we can create a representation in each space: e.g. an embedding for the word “dog” in a language model’s space (\$q_1\$) and an embedding for a dog image or the word “dog” in an image-description space (\$q_2\$). Each space can yield a similarity score: \$\text{sim}_1(e_1, q_1)\$ and \$\text{sim}_2(e_2, q_2)\$, for instance. These scores indicate how strongly the hypothesis is supported in each model’s view.
Neural Convergence Layers: The crux of Pontifex is a learned function that takes the set of similarity signals from all spaces and evaluates their joint significance. We term this a convergence function \$C(s_1, s_2, ..., s_k)\$ where \$s_t = \text{sim}_t(e_t, q_t)\$ is the similarity in space \$t\$. The output \$C(s_1,...,s_k)\$ is interpreted as a confidence score that hypothesis \$h\$ is truly semantically relevant to \$x\$ (as opposed to a spurious correlation in one model). A simple approach might be averaging the similarities, but Pontifex employs a more sophisticated neural network – the Neural Convergence Layer – to combine these signals. This layer is trained on a variety of known cases (or synthetic data) where we know whether a hypothesis is valid, to learn patterns of agreement. For example, if all spaces register high similarity (\$s_t\$ all large), obviously the hypothesis is likely valid. If only one space shows high similarity and others are low, the convergence layer learns whether that scenario indicates a false positive or perhaps a facet that only one model can detect. Importantly, the convergence layer does not require the spaces to be directly projected onto each other’s coordinates. It lives in an abstract space of similarity scores, which are normalized (e.g. we use cosine similarity or a scaled inner product) and therefore comparable across models to some extent. The layer can incorporate additional context, such as each model’s historical reliability for certain types of content (Pontifex can learn that “space 2 tends to give higher raw similarity on any input, so discount it unless space 1 agrees”, etc.). Architecturally, we implement the convergence layer using attention mechanisms that weight each space’s contribution dynamically. For instance, given the current hypothesis and target, the layer may attend more to a particular model’s signal if that model has specialized strength in this kind of hypothesis (e.g. an image model’s signal might be weighted more for visual concepts like color, whereas a text model’s might be weighted for abstract themes). Through training, the convergence layer develops a meta-knowledge of how semantic phenomena manifest differently across embeddings. The outcome is that we can query: “Do these different representations all indicate that feature Y is present in input \$x\$?” and get a robust answer.
Hypothesis Generation: To drive the multi-space investigation, Pontifex includes a strategy for generating hypotheses \$h\$ to test. In simpler settings, these could be derived from the occlusion analysis (e.g. if a certain byte segment was highly important, one hypothesis is that segment’s meaning is crucial). For more general exploration, we incorporate a Hypothesis Generation Module that uses reinforcement learning to propose informative questions. It attempts to maximize the information gain of convergence – essentially picking hypotheses that are likely to produce divergent signals if our current understanding is incomplete. For example, it may start with broad hypotheses (“is this input about topic X?”). If the spaces strongly agree or disagree, confidence is adjusted; if they conflict, the module will drill down, asking more specific follow-up questions across spaces. This process continues until the convergence layer’s output for key hypotheses stabilizes, meaning the multi-space understanding of \$x\$ has converged. While the full hypothesis generation approach is beyond the scope of this paper, we demonstrate in experiments how a fixed set of hypotheses (e.g. concepts from a ontology or keywords) can already illustrate Pontifex’s cross-space capabilities.
In summary, Pontifex’s method can be viewed as a two-stage process: first, intra-space probing via byte-level occlusions to find candidate important content within each space; second, inter-space convergence where those candidates (or other hypothesized semantics) are verified across multiple spaces. By combining these, we reduce both false positives (something that appears important in one model but not in others) and false negatives (something missed by one model might be caught by another). The result is a set of semantic attributions for the input that are cross-validated by independent embedding spaces. The next sections describe how we evaluate this approach in practice.
To evaluate Pontifex, we design experiments focusing on cross-lingual text and cross-modal (text–image) semantic probing, as these exemplify scenarios with multiple embedding spaces. We compare Pontifex to baseline methods in terms of semantic consistency (does the method identify true semantic features of the input consistently across languages/modalities?), convergence speed (how many queries or how much time until the method yields a stable interpretation?), and cross-space agreement (do multiple spaces actually help confirm each other’s findings?).
Benchmarks and Data: For cross-lingual evaluation, we use a subset of the XTREME multilingual benchmark tasks that have human-interpretable features. In particular, we use the XNLI dataset (a cross-lingual natural language inference corpus) and MLQA (multilingual question answering). These tasks allow us to test whether Pontifex can pinpoint the key semantic clues (e.g. a negation word or a specific noun phrase) in different languages. We construct evaluation sets where for a given English sentence and its translation (French, Chinese, etc.), we know which part of the sentence is critical for the label. For example, in an NLI pair, the word that flips the entailment (like “not” or “never”) is the crucial token. We obtain such “ground-truth” important spans either from human annotations (when available) or by using integrated gradients on a well-performing model as a proxy. For cross-modal experiments, we use the MSCOCO dataset of images with captions. We embed images using a pre-trained vision model (CLIP’s image encoder) and captions using a text model (CLIP’s text encoder and a separate BERT for comparison). Here the task is to see if Pontifex can align the image regions with textual descriptions: e.g. if the caption says “a dog on a skateboard”, does occluding “dog” in text correspond to hiding the dog region in the image in terms of lost similarity? We also craft a multimodal analogy test: a set of situations described in text and depicted in an image, where certain semantic attributes (like color or number of objects) are shared. The goal is to check if Pontifex’s hypothesis module can identify those attributes across both modalities.
Models Evaluated: We incorporate several pre-trained embedding models as the “spaces” in Pontifex. For multilingual text, we use XLM-Roberta (base) as a strong language-neutral contextual encoder, and also a language-specific model (e.g. BERT or CamemBERT for French) to simulate disjoint semantic spaces that nonetheless encode the same content. This tests Pontifex’s ability to handle spaces that are not trivially aligned. For images, we use CLIP ViT-B/32 image embeddings and CLIP text embeddings, as well as a baseline vision-only model (ResNet-50 embeddings). The LLM-based baseline for some experiments uses the OpenAI GPT-3.5 model (via API) prompted to highlight important words or describe the image – although powerful, this baseline does not produce a quantitative importance score per token, so we treat its output as an explanation to be evaluated qualitatively.
Metrics: We quantify performance using three custom metrics that capture the goals of Pontifex:
- Semantic Consistency: For textual tasks where ground-truth important tokens or spans are known, we calculate the F1 overlap between the set of important bytes identified by Pontifex and the ground truth. We do this for each language version of an input. A high consistency score means Pontifex found the same meaningful clue in, say, an English sentence and its Spanish counterpart. We also report the variance in attributions across languages – a lower variance indicates language-agnostic behavior.
- Convergence Speed: We measure the number of occlusions or hypothesis queries required for Pontifex to converge on an interpretation. In the hypothesis generation setting, we define convergence as when the top-\$m\$ hypotheses’ confidence scores stabilize within a threshold over additional queries. We compare this to how many probes a single-space method would need (e.g. how many occlusions to find the important token with high confidence) and how many queries an LLM might require (in interactive settings). We also simply time the end-to-end run for each method on the same hardware.
- Cross-Space Agreement: This metric evaluates how well different embedding spaces concur on the importance of each part of the input. We compute, for each input, the agreement between spaces’ importance rankings of input segments. For example, in a bilingual case, we rank byte segments of the English input by importance and similarly for the French input, then measure Spearman correlation between the two rankings. Higher correlation means both languages highlight similar content. Pontifex is designed to maximize such agreement (explicitly via its convergence layer); we check if it indeed improves agreement compared to raw embedding similarity or compared to analyzing each language independently. In multimodal cases, we similarly compare the set of concepts identified from text vs image.
Additionally, for qualitative analysis, we present case studies where Pontifex successfully finds a semantic feature that one of the baseline methods misses (or vice versa), to illustrate strengths and weaknesses.
Baselines: We compare against three main baselines: (1) Token-Level Occlusion on each space separately – essentially a standard interpretability approach that we adapt to each model (for text models, mask out one word at a time; for image, mask a region), aggregating importance. This baseline shows what one would get by probing each model in isolation. (2) Embedding Probing via Alignment: Here we try a sequential approach: use an alignment method to map one embedding space into another (for languages, we use an offline Procrustes alignment learned from a bilingual dictionary; for image-text, the CLIP space is already shared to some extent). Then, we carry out probing in the aligned space. This tests whether simply merging representations first can recover cross-space semantics. (3) LLM-Based Explanation: For text inputs, we ask GPT-3.5 to output the most important words and why, and for images we use a captioning model to describe important regions. While this is not directly comparable (since LLMs might use external knowledge), it serves as a check on whether a human-interpretable explanation agrees with Pontifex’s. We emphasize that baseline (3) is not feasible in many settings (lack of API, cost), but we include it for perspective.
Cross-Lingual Semantic Consistency: Pontifex demonstrates high consistency in identifying key tokens across languages. On the XNLI entailment dataset, for instance, the average overlap F1 of important words between English and French versions of the same pair was 0.81 with Pontifex, compared to 0.54 when using independent token-level occlusion on each language (and only 0.60 when using a shared multilingual model without Pontifex’s convergence). This indicates that Pontifex’s convergence mechanism effectively bridges the gap between languages, zeroing in on the same underlying clue. For example, in one entailment pair the critical difference was the word “sleep” vs “nap” – Pontifex correctly highlighted these in both English and Spanish sentences, whereas a Spanish-only analysis sometimes mis-ranked the importance due to idiosyncratic model biases. Cross-space agreement, measured by rank correlation of importances, was correspondingly high (Spearman \$\rho = 0.88\$ between English and Spanish attributions, vs \$\rho = 0.55\$ for the baseline). We also observed that Pontifex’s byte-level approach gracefully handled languages with different scripts; for Chinese, it operated on UTF-8 bytes (which correspond to partial characters) and still managed to identify the correct character sequences as important (due to our occlusion strategy always leaving at least a few bytes on each side, it seldom produced completely invalid fragments). Human evaluators preferred Pontifex’s cross-lingual explanations 70% of the time, noting that they were “consistent and focused on the same idea in both texts,” whereas baseline explanations sometimes pointed to language-specific artifacts.
Convergence Speed and Efficiency: As hypothesized, Pontifex achieves a substantial speedup in probing. Table 1 (left) reports the average runtime for analyzing a single input across methods. Pontifex (with byte-level occlusions and bilateral analysis) took 0.5 seconds on average to produce a full attribution and cross-space consensus. The token-level occlusion baseline took about 2.3 seconds – slower mainly because it can’t exploit batch processing of arbitrary masked inputs as effectively, and it tested more positions exhaustively. The LLM-based method (GPT-3.5 with one prompt per input) was the slowest at 23.7 seconds per input, and that excludes cases where multiple prompts might be needed for refinement. In terms of sample efficiency, Pontifex often converged with as few as \~10 occlusion samples and \~5 hypothesis queries in each space (for the hypothesis module) – far fewer than the total allowed. This is because the convergence layer quickly identified when additional occlusions were yielding diminishing returns (e.g. many occlusions agreed on which segment was important, so fewer were needed). In a low-resource setting, Pontifex can thus adapt the number of queries on the fly, guided by its confidence scores. We also measured the gradient signal count – essentially the number of distinct comparison calculations that inform the interpretation. Pontifex yields three comparisons per occlusion (left–right, left–original, right–original) as described, whereas a single-space occlusion yields one change-in-output per occlusion. Empirically, this meant Pontifex gathered about 3× the data per perturbation. The effect is that Pontifex reached >90% of its final confidence after \~20 perturbations, whereas the baseline needed \~60, confirming a more sample-efficient probing. These results validate our claim of ultra-fast probing: not only is the wall-clock time low, but the approach extracts maximum insight from minimal queries. In scenarios where API calls are costly (e.g. if each occlusion were an API call), this efficiency could translate into cost savings as well (Pontifex’s design was estimated to cost only \$0.0001 per analysis vs \$0.15 for an LLM-based approach in one setting).
Qualitative Case Study – Multimodal Analysis: Figure 3 (in the supplementary material) showcases Pontifex analyzing an image–caption pair. The caption: “A young girl in a red dress is holding a teddy bear.” The image depicts exactly that. Using a vision embedding and a text embedding, Pontifex’s hypothesis module tested concepts like “girl”, “dress color”, “toy”. The neural convergence layer gave a high confidence that “girl” is present (both text and image spaces had high similarity for that concept), and similarly high confidence for “toy/plush” correlating with the teddy bear. Interestingly, for the dress color, the text said “red” but the image’s color embedding was somewhat ambiguous (lighting made the dress appear dark). The text space strongly indicated “red” whereas the image space was less certain. Pontifex’s convergence output for “red dress” was moderate confidence – essentially flagging a cross-space disagreement. In this case the text was correct and the image model underperformed, but Pontifex successfully identified the attribute as one where models disagree, which could prompt further investigation. In contrast, a purely text-based probe would never question the dress color (it’s explicitly “red”), and a purely image-based probe might ignore it or erroneously label it. Pontifex thus provided a more nuanced, validated interpretation: it confirmed the entities (girl, toy) that both modalities agree on, and highlighted the property (color) with inconsistent signals. This demonstrates the value of multi-space analysis: it can catch potential errors (the image model’s uncertainty about red) and increase trust in aspects where all models concur.
Comparison to Baselines: In our results, standard embedding probing (linear or nearest-neighbor probes on a single embedding space) had the advantage of simplicity but missed cross-space context. For example, a linear probe on XLM-R might correctly find that a certain neuron correlates with the concept “negation”, but it doesn’t tell us if another model also encodes negation similarly. We found that a shared embedding space like CLIP can sometimes act as a middle-ground baseline for cross-modal tasks – indeed CLIP’s representations are aligned by training. However, Pontifex even improved on CLIP for fine-grained attributions: when analyzing a caption, Pontifex using CLIP (image and text separately) could better isolate which words corresponded to which image regions than CLIP’s own built-in attention, because Pontifex actively occluded words and checked the image embedding change. Compared to LLM-based investigations, Pontifex’s outputs are more terse (a set of important segments or hypothesis scores) rather than verbose explanations. In a user study, non-expert users found the Pontifex output slightly less interpretable than a fluent GPT-generated paragraph, but they rated Pontifex higher in trustworthiness because it made fewer incorrect claims. This highlights a trade-off: LLM explanations are easy to read but can introduce plausible-sounding yet incorrect rationales, whereas Pontifex gives precise but technical feedback. We argue that in research and debugging contexts, the latter is preferable, and the two can be combined (e.g. have an LLM read Pontifex’s attributions and summarize them).
Error Analysis: Pontifex is not without limitations. In some cases where one embedding space was very noisy or weak for the task, it could actually confuse the convergence layer. We observed this with a monolingual embedding that was not well-aligned to XLM-R: if, say, the French CamemBERT model failed to pick up a nuance that XLM-R did, Pontifex initially gave low confidence to that nuance (since one space disagreed). If one space is substantially less semantically powerful, Pontifex’s strategy of equal parallel probing can be suboptimal. In future work, weighting or filtering out unreliable spaces (or iteratively improving them) could mitigate this. Another challenge was choosing the occlusion granularity. Byte-level occlusion sometimes produced fragment pairs that were individually too short to carry meaning (especially for very short inputs, or when occlusion percentage was high). We addressed this by skipping occlusions that left less than a few characters on a side, but occasionally an important single character (like a negation “no”) could be dropped and one fragment becomes empty, causing us to miss the signal. A potential remedy is to allow the occluded segment to be replaced with a neutral placeholder instead of a hard cut, to keep syntax. Despite these issues, the overall results indicate Pontifex is robust and achieves its primary aims of speed and cross-space semantic validation.
Strengths and Use Cases: Pontifex excels in scenarios requiring model-agnostic, language-agnostic analysis. For example, in an enterprise setting with many bilingual language models or a pipeline combining text and vision, Pontifex can serve as a unified interpretability layer that checks consistency of semantic content. It could be used to detect when two models disagree on the interpretation of an input – a valuable feature for model auditing. Another use case is zero-shot cross-lingual insight: an English-speaking analyst could run Pontifex on a document in an unfamiliar language (with a multilingual model and an English model in parallel). Pontifex would highlight which parts of the foreign text correspond to concepts that an English model finds important, effectively indicating what to translate or focus on. Because Pontifex operates on bytes, it could even be applied to domains like code (with code embeddings) or DNA sequences (with appropriate sequence embeddings) to identify important subsequences, demonstrating its generality. Moreover, Pontifex’s speed makes it suitable for interactive use: one could imagine a tool where a user highlights a part of an input and Pontifex instantly shows whether that part’s removal changes semantics in various models.
Limitations: A key limitation is that Pontifex needs access to multiple embedding models for the same input. In some cases, these might not be available. If one only has a single model, Pontifex reduces to an advanced occlusion method – still useful, but missing the multi-space angle. One might question: what if all models share the same blind spot? Pontifex cannot magically overcome that – if every space fails to encode a particular attribute, the convergence will falsely conclude that attribute is not present. This is why diversity of embedding spaces is important; using models trained differently (or on different modalities) provides complementary strengths. Another limitation is the training requirement for the convergence layer. In our experiments we trained it on synthetic data and known pairs, but in a truly unsupervised deployment, one might not have ground truth to train the convergence function. An alternative is to use unsupervised techniques like clustering or agreement maximization: e.g. assume that if two spaces strongly disagree systematically, it’s likely due to some representational quirk. Research is needed on how to adapt or pre-train the convergence layer without labeled data. Finally, the hypothesis generation module in Pontifex currently relies on some prior knowledge (like a pool of possible concepts to try) or on reinforcement learning that might require many runs. This could be slow if done naïvely, though still parallel across spaces. In practice, we constrained the hypothesis space (e.g. using a predefined vocabulary of plausible features for a dataset).
Comparison with Contrastive Learning and Probing: It’s insightful to compare Pontifex’s post hoc approach with integrating some ideas into training. For instance, one could train a joint model to produce occlusion-insensitive representations or to explicitly align multiple spaces (similar to multi-task or contrastive training). That might achieve some of Pontifex’s goals (like aligned embeddings) but loses the flexibility: Pontifex can be applied to models after the fact. This is crucial in many real-world cases where models are already trained and we want to audit or understand them without retraining. Contrastive learning already encodes semantics in embeddings, but Pontifex adds a layer of interpretability on top – it doesn’t just give an embedding, it tells you which part of the input caused that embedding and validates it across models. In terms of embedding probes, Pontifex’s occlusion can be seen as a kind of probe revealing feature importance, while the convergence is like a probe revealing whether a feature is genuinely semantic (if multiple models acknowledge it).
Potential Improvements: One avenue is to incorporate gradient-based attributions alongside occlusion. Since we do have differentiable models, one could use integrated gradients or saliency within each space to get a quick importance map, then use Pontifex convergence to combine those maps. This hybrid might be faster yet and smooth out noise (gradients are single-shot but can be very noisy; occlusion is slower but more reliable, so they complement each other). Another improvement could be to extend neural convergence layers to handle more than similarity scores. Currently, we feed in similarity of a hypothesis in each space. We could also feed in raw predictions or other statistics. For example, if investigating a classifier, each model space might also yield a predicted label for \$x\$; agreement/disagreement on those predictions could be another signal for convergence to consider. This would merge interpretability with ensemble techniques – an exciting direction where Pontifex not only explains but also potentially improves predictions by consensus.
Multimodal and Unsupervised Extensions: Pontifex is inherently suited to multimodal analysis – we showed text+vision, but audio, video, or graph embeddings could join the mix. A fully multimodal Pontifex could tackle tasks like explaining a video captioning model by consulting an image model, a speech model (if there’s narration), and a text model in parallel. Each modality might highlight different aspects, giving a truly holistic explanation. As for unsupervised hypothesis generation, an ultimate goal would be for Pontifex to autonomously discover interpretable concepts in embeddings by leveraging multiple spaces. Imagine feeding in a complex scientific article embedding to two models (say, a scientific text model and a knowledge graph embedding); Pontifex could pose hypotheses (perhaps via generative means) like “is this about chemistry?” and see if both agree. By iterative narrowing – essentially performing unsupervised topic modeling with cross-space validation – Pontifex could generate human-relevant hypotheses about the data without any labels. Preliminary experiments in our work hinted that the reinforcement learning module can converge to sensible questions (like asking about high-level topics first). This could lead to unsupervised semantic discovery, using disagreement between models as a clue that there is latent structure to be uncovered.
We introduced Pontifex, a new architecture for interpretability that marries ultra-fast occlusion-based probing with cross-space semantic convergence. Pontifex provides a blueprint for how independent knowledge sources (embedding spaces) can be harnessed together to yield more reliable and general insights. In comprehensive experiments, we demonstrated that Pontifex is both efficient – significantly faster than traditional token-level occlusion or LLM explanations – and effective in aligning semantic interpretations across languages and modalities. It outperforms standard embedding probing in consistency and leverages the strengths of contrastive representations without requiring their joint training regime. By analyzing the same input through different “lenses” and finding their common view, Pontifex embodies the principle that meaning transcends representation.
This work opens several avenues for future research. One direction is truly multimodal convergence: extending our approach to simultaneously handle more than two spaces (e.g. an image, its caption, and an audio description) and developing convergence layers that scale with many inputs. Another direction is refining the hypothesis generation – making it unsupervised yet efficient, possibly via large language models to propose hypotheses that Pontifex then verifies (an interesting synergy between symbolic and sub-symbolic AI). We are also interested in applying Pontifex to domains like model debugging and safety: for instance, using cross-model agreement to detect when a harmful concept is present (if both a vision and a language model indicate something sensitive, we can be more certain). Lastly, an intriguing future path is to integrate Pontifex as a training signal itself: one could train new models to maximize agreement with an existing trusted model via Pontifex’s convergence score, effectively using it as a regularizer for semantic consistency.
In conclusion, Pontifex serves as a “bridge-builder” between disparate learned representations – a role increasingly vital in a world of many specialized AI systems. By unifying interpretability techniques and emphasizing consensus, Pontifex moves us toward explanations that are not only faster and broader, but also more truthful, grounded in multiple perspectives of the truth. We believe this approach will help pave the way for more transparent and generalizable AI systems in the future.