Cohesive Systems logoCOHESIVE SYSTEMS

Back to ARI overview or review the ARI Architecture.

ARI Formalization

ARI solves a structured prediction problem over a bipartite graph between source and target field paths. The objective is to infer a globally consistent mapping via maximum a posteriori (MAP) inference in a constrained graphical model.

Problem setup

Let the sets of source and target field paths be:

S={si}T={tj}S = \{s_i\} \qquad T = \{t_j\}

A mapping is a binary relation RR:

RS×TR \subseteq S \times T

Define indicator variables:

xs,t{0,1},(s,t)S×Tx_{s,t} \in \{0,1\}, \quad (s,t) \in S \times T

where xs,t=1x_{s,t}=1 means that the source ss and target tt fields are compatible.

Then, equivalently, we can define RR as:

R={(s,t)xs,t=1}R = \{(s,t) \mid x_{s,t} = 1\}

MAP objective

ARI defines a Gibbs distribution over mappings:

P(R)exp(E(R))P(R) \propto \exp\left(-E(R)\right)

and seeks the MAP solution:

R=argminRE(R)R^* = \arg\min_R E(R)

Energy decomposition

The energy decomposes into unary and pairwise terms:

E(x)=(s,t)S×Tψu(s,t)xs,t+(s,t)(s,t)ψp((s,t),(s,t))xs,txs,tE(x) = \sum_{(s,t)\in S\times T} \psi_u(s,t)\,x_{s,t} + \sum_{(s,t)\neq(s',t')} \psi_p\big((s,t),(s',t')\big)\,x_{s,t}x_{s',t'}

where:

  • ψu\psi_u encodes local compatibility
  • ψp\psi_p encodes structural consistency

This corresponds to a pairwise Markov random field over candidate matches.

Feature representation

Each candidate pair is mapped to a feature vector:

ϕ:S×TRd\phi : S \times T \to \mathbb{R}^d

denoted by xs,t\mathbf x_{s,t} for pair (s,t)(s,t):

xs,t=ϕ(s,t)\mathbf{x}_{s,t} = \phi(s,t)

Typical features include:

  • Lexical similarity
  • Structural context
  • Ontology compatibility
  • Embedding similarity: ϕemb(s,t)=f(s),g(t)\phi_{\text{emb}}(s,t) = \langle f(s), g(t) \rangle

Unary scoring

Unary potentials (scores) are parameterized as:

ψu(s,t)=fθ(xs,t)\psi_u(s,t) = -f_\theta(\mathbf{x}_{s,t})

Examples:

  • Linear: fθ=wxf_\theta = w^\top \mathbf{x}
  • Tree/MLP models (GBDT, neural scoring)

Candidate pruning retains:

Ck(s)=Top-kfθ(xs,t){tC(s)}C_k(s) = \operatorname{Top\text{-}k}_{f_\theta(\mathbf{x}_{s,t})} \{t \in C(s)\}

Pairwise / structured scoring

Pairwise potentials capture dependencies:

ψp((s,t),(s,t))=pθ((s,t),(s,t))\psi_p((s,t),(s',t')) = -p_\theta((s,t),(s',t'))

Examples:

  • Cross-encoder: pθ=hθ(s,t,s,t)p_\theta = h_\theta(s,t,s',t')
  • Structured models (CRF / GNN): pθ=ψθ(G,(s,t),(s,t))p_\theta = \psi_\theta(\mathcal{G}, (s,t), (s',t'))

These enforce:

  • Structural alignment
  • Co-occurrence patterns
  • Ontological consistency

Constrained optimization

The MAP problem can be written as an integer quadratic program.

Using ψu(s,t)=fθ(xs,t)\psi_u(s,t) = -f_\theta(\mathbf{x}_{s,t}) and ψp((s,t),(s,t))=pθ((s,t),(s,t))\psi_p((s,t),(s',t')) = -p_\theta((s,t),(s',t')), the MAP objective

R=argminRE(R)R^* = \arg\min_R E(R)

becomes (equivalently)

maxx{0,1}S×T  (s,t)fθ(xs,t)xs,t+(s,t)(s,t)pθ((s,t),(s,t))xs,txs,t\max_{x\in\{0,1\}^{|S|\times|T|}} \; \sum_{(s,t)} f_\theta(\mathbf{x}_{s,t})\,x_{s,t} + \sum_{(s,t)\neq(s',t')} p_\theta\big((s,t),(s',t')\big)\,x_{s,t}x_{s',t'}

subject to:

One-to-one constraints

txs,t1ssxs,t1t\sum_{t} x_{s,t} \le 1 \quad \forall s \qquad \sum_{s} x_{s,t} \le 1 \quad \forall t

Type / ontology constraints

xs,t=0if incompatible(s,t)x_{s,t} = 0 \quad \text{if } \text{incompatible}(s,t)

Structural constraints

  • Mutual exclusion: xs,t+xs,t1x_{s,t} + x_{s',t'} \le 1
  • Hierarchical consistency: xs,txparent(s),parent(t)x_{s,t} \le x_{\text{parent}(s), \text{parent}(t)}

This yields an ILP / quadratic optimization problem.

Solution

The optimal mapping is:

R={(s,t)Ckxs,t=1}R^* = \{(s,t) \in C_k \mid x_{s,t} = 1\}

Training objective

The models are trained over heterogeneous datasets:

D=αDpre+βDgold+γDfeedback+δDnegD = \alpha D_{\text{pre}} + \beta D_{\text{gold}} + \gamma D_{\text{feedback}} + \delta D_{\text{neg}}

Optimize:

L=λ1Lcontrastive+λ2Lhard-neg+λ3Lranking\mathcal{L} = \lambda_1 \mathcal{L}_{\text{contrastive}} + \lambda_2 \mathcal{L}_{\text{hard-neg}} + \lambda_3 \mathcal{L}_{\text{ranking}}

References

Cohesive ARI Architecture