9. Advanced - How to Define Custom Gates

In an earlier example notebook, we showed you the basics of defining custom gates. The approach demonstrated there may already be enough for your purposes, and potentially already as efficient as possible. In this notebook, we will discuss the considerations that go into making high-performing gates.

using Revise
using PauliPropagation

The SWAP example

Let us again consider the SWAP gate example

struct CustomSWAPGate <: StaticGate
    qinds::Tuple{Int, Int}  # The two sites to be swapped
end

Again define the action,

function PauliPropagation.apply(gate::CustomSWAPGate, pstr, coeff; kwargs...)
    # get the Pauli on the first site
    pauli1 = getpauli(pstr, gate.qinds[1])
    # get the Pauli on the second site
    pauli2 = getpauli(pstr, gate.qinds[2])
    
    # set the Pauli on the first site to the second Pauli
    pstr = setpauli(pstr, pauli2, gate.qinds[1])
    # set the Pauli on the second site to the first Pauli
    pstr = setpauli(pstr, pauli1, gate.qinds[2])

    # apply() is always expected to return a tuple of (pstr, coeff) tuples
    return tuple((pstr, coeff))
end

Now set up a bigger simulation with 25 qubits on a 5 by 5 grid.

nx = 5
ny = 5
nq = nx * ny

topology = rectangletopology(nx, ny)

nl layers of a circuit consisting of RX and RZZ Pauli rotations, but insert a layer of swaps in between.

trotter_layer = tfitrottercircuit(nq, 1; topology=topology);

nl = 3
ourSWAP_circuit = Gate[]
append!(ourSWAP_circuit, trotter_layer)
for _ in 2:nl
    append!(ourSWAP_circuit, (CustomSWAPGate(pair) for pair in topology))
    append!(ourSWAP_circuit, trotter_layer)
end
nparams = countparameters(ourSWAP_circuit)
195

Define our observable as $Z_7 Z_{13}$.

pstr = PauliString(nq, [:Z, :Z], [7, 13])
PauliString(nqubits: 25, 1.0 * IIIIIIZIIIIIZIIIIIII...)

Circuit parameters with a random seed.

using Random
Random.seed!(42)
thetas = randn(nparams)

For this notebook, we will use a minimum coefficient threshold. The results are still almost exact in this simple case.

min_abs_coeff = 1e-3
0.001

Run the circuit

@time ourSWAP_psum = propagate(ourSWAP_circuit, pstr, thetas; min_abs_coeff=min_abs_coeff)
  3.075229 seconds (2.46 M allocations: 121.152 MiB, 0.77% gc time, 97.81% compilation time)





PauliSum(nqubits: 25, 13198 Pauli terms:
 -0.013537 * IIZZIZXXIIIIYZIIIZII...
 0.0025156 * IIIIIIXYIIIIZIIIIXZI...
 -0.0016705 * IIIIIIYYXIIXYZIIZZII...
 0.002661 * IIIZIIZYYIIZYXIIIZZI...
 -0.0027996 * IIXZIIYYIIIIYZIIIYZI...
 -0.001725 * IIZZIIYYZIIIZYIIIIZI...
 0.0013354 * IIZZIYYYIIIIXIIIIXII...
 0.001438 * IIIIIIXYZIIZIYIIIIZZ...
 0.0015348 * IZIIIIXIIIIXYIIIIXZI...
 -0.0018285 * IZIIIZYIIIIYYIIIIYZI...
 0.0034953 * IIIIIIZYIIIIZIIIIZII...
 0.0027765 * IIIIIIYIZIIXXXZIZYZI...
 -0.0010834 * IIZIIZXYZIIIYYZIIZZZ...
 -0.0030723 * IIZZIYYZZIIIYYIIIYII...
 -0.0011065 * IZIZIIXZIIIIYZIIIXZI...
 0.0025227 * IIXIIZZIIIIZXIIIIYZI...
 0.001622 * IZIIIYYYIIIXXIIIZZII...
 -0.0020525 * IIIIIYYYIIIXYIIIZZII...
 -0.0025499 * IIIIIIXIIIIXXIIIZXZI...
 -0.0066057 * IIYZIZZIIIIZXIIIIZII...
  ⋮)

Overlap with the zero-state:

overlapwithzero(ourSWAP_psum)
0.11159958430219122

We already mentioned that PauliPropagation.jl contains a CliffordGate implementation of SWAP. Let's implement the same thing and compare performance.

cliffSWAP_circuit = Gate[]
append!(cliffSWAP_circuit, trotter_layer)
for _ in 2:nl
    append!(cliffSWAP_circuit, (CliffordGate(:SWAP, pair) for pair in topology))
    append!(cliffSWAP_circuit, trotter_layer)
end
@time cliffSWAP_psum = propagate(cliffSWAP_circuit, pstr, thetas; min_abs_coeff=min_abs_coeff)
  0.308252 seconds (110.23 k allocations: 6.868 MiB, 79.18% compilation time)

Are the results the same?

overlapwithzero(cliffSWAP_psum)
0.11159958430219122
cliffSWAP_psum == ourSWAP_psum
true

Yes! We can also benchmark the performance.

using BenchmarkTools
@btime propagate($ourSWAP_circuit, $pstr, $thetas; min_abs_coeff=$min_abs_coeff)
  44.344 ms

 (1355 allocations: 1.83 MiB)
@btime propagate($cliffSWAP_circuit, $pstr, $thetas; min_abs_coeff=$min_abs_coeff)
  41.943 ms

 (1355 allocations: 1.83 MiB)

No downside at all from defining our custom gate. How? This is because the apply function for this gate is type stable! Type stability is absolutely crucial in Julia, and codes live and die by it.

@code_warntype apply(CustomSWAPGate((1, 1)), pstr.term, 1.0)
MethodInstance for apply(::CustomSWAPGate, ::PauliPropagation.UInt56, ::Float64)


  from apply(

gate::CustomSWAPGate, pstr, coeff; kwargs...) @ Main In[3]:1
Arguments


  #self#::Core.Const(PauliPropagation.PropagationBase.apply)


  gate::CustomSWAPGate


  pstr::PauliPropagation.UInt56
  coeff::Float64
Body::Tuple{Tuple{PauliPropagation.UInt56, Float64}}
1 ─

 %1 = Main

.:(var"#apply#1")::Core.Const(Main.var"#apply#1")
│   %2 = Core

.NamedTuple()::Core.Const(NamedTuple())
│   %3 = Base.pairs(%

2)::Core.Const(Base.Pairs{Symbol, Union{}, Nothing, @NamedTuple{}}())


│   %4 = (%1)(%3, #self#, gate, pstr, coeff)::Tuple{Tuple{PauliPropagation.UInt56, Float64}}


└──      return %4

All blue means that everything is great! If correctly implemented, apply will be type stable if it returns a known number of Pauli and coefficient pairs. Here it is just 1 because it is a Clifford gate.

A gate that branches into more than one Pauli string

Onto an example of a gate that can split a Pauli string into two: The T gate.

struct CustomTGate <: StaticGate
    qind::Int
end

A T gate is a non-Clifford gate that commutes with I and Z, splits X into cos(π/4)X - sin(π/4)Y, and Y into cos(π/4)Y + sin(π/4)X (in the Heisenberg picture).

Let's write the code for that.

function PauliPropagation.apply(gate::CustomTGate, pstr, coeff; kwargs...)
    # get the Pauli on the site `gate.qind`
    pauli = getpauli(pstr, gate.qind)
    
    if pauli == 0 || pauli == 3  # I or Z commute
        # return a tuple of one (pstr, coeff) tuple
        return tuple((pstr, coeff))     
    end
    
    if pauli == 1 # X goes to X, -Y
        new_pauli = 2  # Y
        # set the Pauli
        new_pstr = setpauli(pstr, new_pauli, gate.qind)
        # adapt the coefficients
        new_coeff = -1 * coeff * sin(π/4)
        
    else # Y goes to Y, X
        new_pauli = 1  # X
        # set the Pauli
        new_pstr = setpauli(pstr, new_pauli, gate.qind)
        # adapt the coefficients
        new_coeff = coeff * sin(π/4)
    end

    updated_coeff = coeff * cos(π/4)

    # return a tuple of two (pstr, coeff) tuples
    return tuple((pstr, updated_coeff), (new_pstr, new_coeff))
    
end

Insert a layer of TGates after each Trotter layer.

ourT_circuit = Gate[]
for _ in 1:nl
    append!(ourT_circuit, trotter_layer)
    append!(ourT_circuit, (CustomTGate(qind) for qind in 1:nq)) 
end

And run:

@time ourT_psum = propagate(ourT_circuit, pstr, thetas; min_abs_coeff=min_abs_coeff)
  0.370778 seconds (119.49 k allocations: 7.802 MiB, 75.85% compilation time)





PauliSum(nqubits: 25, 15787 Pauli terms:
 -0.0032424 * IXIIIIZXXIIZXZIIIZII...
 0.0027962 * IZZIIIIYIIIZXZIIIZII...
 -0.0014795 * IIZIIIZXZIIZZXZIIIZI...
 0.0037487 * IZIIIYXZIIIXXIIIZZII...
 0.0014359 * IZIIIIYZIIIXYIIIZXZI...
 -0.0017245 * IYYIIIXXXIIIXIIIIZII...
 -0.0029594 * IXIIIIXXIIIIYZIIIZII...
 0.0010417 * XXZIIZXYIIIIYZIIIZII...
 0.0025217 * IIIIIZXIZIIIYYIIZXII...
 -0.0018537 * IXIIIIYIIIIZXZIIIYZI...
 -0.0010255 * IZIIIZYZIIIXYZIIZYII...
 -0.0037346 * IZIIIYXYIIIXYZIIZZII...
 -0.0010858 * IZIIIIYIIIIZXZIIZYZI...
 -0.0017078 * IZZIIYXYZIIZXZIIIZII...
 0.0013272 * IYIIIYYXXIIIYZIIIZII...
 0.0010339 * IIIIIIZIZZIIYXYIIYIZ...
 0.0011072 * IZIIIIXIIIIXYIIIIXZI...
 -0.0010679 * IIIZIIZIYIIIYZZIIZII...
 -0.0010275 * IXZIIYXYZIIZXZIIIZII...
 0.0012996 * IYIIIYYIIIIZZIIIIZII...
  ⋮)
overlapwithzero(ourT_psum)
0.3146654070299996

But did it work? Again, we have an implementation of a TGate in our library. In case you are interested, we currently implement T gates as Pauli Z rotations at an angle of π/4. Let's compare to that.

libraryT_circuit = Gate[]
for _ in 1:nl
    append!(libraryT_circuit, trotter_layer)
    append!(libraryT_circuit, (TGate(qind) for qind in 1:nq)) 
end

If you call PauliGate(:Z, qind, parameter), this will create a so-called FrozenGate wrapping the parametrized PauliGate, with a fixed parameter at the time of circuit construction.

Run it and compare

@time libraryT_psum = propagate(libraryT_circuit, pstr, thetas; min_abs_coeff=min_abs_coeff)
  0.100648 seconds (15.11 k allocations: 2.890 MiB, 16.24% compilation time)
overlapwithzero(libraryT_psum)
0.3146654070299997
libraryT_psum == ourT_psum
true

It works! But is it optimal?

using BenchmarkTools
@btime propagate($ourT_circuit, $pstr, $thetas;min_abs_coeff=$min_abs_coeff)
  65.407 ms

 (1351 allocations: 2.22 MiB)
@btime propagate($libraryT_circuit, $pstr, $thetas; min_abs_coeff=$min_abs_coeff)
  54.233 ms

 (1656 allocations: 2.24 MiB)

No, because apply for the CustomTGate is not type-stable.

@code_warntype apply(CustomTGate(1), pstr.term, 0.0)
MethodInstance for apply(::CustomTGate, ::PauliPropagation.UInt56, ::Float64)
  from apply(gate::CustomTGate, pstr, coeff; kwargs...) @ Main In[21]:1
Arguments
  #self#::Core.Const(PauliPropagation.PropagationBase.apply)
  gate::CustomTGate
  pstr::PauliPropagation.UInt56
  coeff::Float64
Body::Union{Tuple{Tuple{PauliPropagation.UInt56, Float64}}, Tuple{Tuple{PauliPropagation.UInt56, Float64}, Tuple{PauliPropagation.UInt56, Float64}}}
1 ─ %1 = Main.:(var"#apply#8")::Core.Const(Main.var"#apply#8")
│   %2 = Core.NamedTuple()::Core.Const(NamedTuple())
│   %3 = Base.pairs(%2)::Core.Const(Base.Pairs{Symbol, Union{}, Nothing, @NamedTuple{}}())
│   %4 = (%1)(%3, #self#, gate, pstr, coeff)::Union{Tuple{Tuple{PauliPropagation.UInt56, Float64}}, Tuple{Tuple{PauliPropagation.UInt56, Float64}, Tuple{PauliPropagation.UInt56, Float64}}}
└──      return %4

It either returns a tuple of one tuple Tuple{Tuple{UInt64, Float64}} or a tuple of two tuples Tuple{Tuple{UInt64, Float64}, Tuple{UInt64, Float64}}. Yellow @code_warntype output means it might be okay (it is not that much slower after all), but be wary of red. When this is the case, you may want to define some more involved functions above apply() for optimal performance. This is how we would do it. To avoid such type instabilities, we can overload a higher level function applytoall!(). This is also usefull because the runtime of the T-gate simulation is dominated by commutation (because the Pauli I is very comon for local observables), we could leave those commuting Pauli strings where they are -> in their original Pauli sum. For this, we can overload the function applytoall!(), which differs in that one performs the loop over the Pauli strings in the Pauli sum here, and one can thus use the old Pauli sum more flexibly. You will receive a AbstractPauliPropagationCache object (here a PauliPropagationCache that wraps two PauliSum objects. Our convention is that anything left in the main psum or the auxiliary aux_psum is later merged back into psum. Thus, we can simply skip the commuting Pauli strings, and edit the coefficient of Pauli strings in-place. See this version of the function:

function PauliPropagation.applytoall!(gate::CustomTGate, prop_cache::AbstractPauliPropagationCache; kwargs...)
    psum = mainsum(prop_cache)
    aux_psum = auxsum(prop_cache)
    
    for (pstr, coeff) in psum 
        # the content of the previous apply() function:
        pauli = getpauli(pstr, gate.qind)

        if pauli == 0 || pauli == 3  # I or Z commute
            # do nothing
            continue
        end

        if pauli == 1 # X goes to X, -Y
            new_pauli = 2  # Y
            new_pstr = setpauli(pstr, new_pauli, gate.qind)
            new_coeff = -1 * coeff * sin(π/4)
        else # Y goes to Y, X
            new_pauli = 1  # X
            new_pstr = setpauli(pstr, new_pauli, gate.qind)
            new_coeff = coeff * sin(π/4)
        end

        updated_coeff = coeff * cos(π/4)

        # you can use add!() if the Pauli sums may already contain that string 
        # here we know they don't, so we use set!() for minorly better performance
        set!(psum, pstr, updated_coeff)
        set!(aux_psum, new_pstr, new_coeff)
    end
    return prop_cache
end
@time ourT_psum2 = propagate(ourT_circuit, pstr, thetas; min_abs_coeff=min_abs_coeff)
  0.208963 seconds (46.88 k allocations: 4.395 MiB, 46.51% compilation time: 100% of which was recompilation)
overlapwithzero(ourT_psum2)
0.3146654070299997
ourT_psum == ourT_psum2
true

And check the performance. It is equivalent for all intents and purposes.

@btime propagate($ourT_circuit, $pstr, $thetas; min_abs_coeff=$min_abs_coeff)
  54.878 ms

 (1356 allocations: 2.23 MiB)
@btime propagate($libraryT_circuit, $pstr, $thetas; min_abs_coeff=$min_abs_coeff)
  53.936 ms

 (1656 allocations: 2.24 MiB)

Enjoy defining custom and high-performance gates!