8. An example of Automatic Differentiation
Pauli propagation has promise in optimizing variational models because its elementary operations are very simple and automatically differentiable. Additionally, it may be the case that strongly truncated simulations still yield final parameters that translate to high-quality quantum circuits on, for example, a quantum computer. This means that we could (perhaps) very cheaply train/optimize, and then deploy the learned model on a different classical or quantum method. There are some initial indications of this being possible, but they require further exploration.
In this notebook, we demonstrate how to calculate fast automatic gradients of a quantum circuit using Pauli propagation.
using PauliPropagationLet us set up a simulation:
nq = 64
topology = bricklayertopology(nq)We define a transverse field Hamiltonian, whose derivative we will compute. This could be used within a variational energy minimization routine to find its ground state.
The Hamiltonian here reads $H = \sum_{i}X_i + \sum_{\langle i, j\rangle}Z_iZ_j$ where $\langle i, j\rangle$ denotes neighbors on the topology.
H = PauliSum(nq)
for qind in 1:nq
add!(H, :X, qind, 1.0)
end
for pair in topology
add!(H, [:Z, :Z], collect(pair), 1.0)
end
HPauliSum(nqubits: 64, 127 Pauli terms:
1.0 * IIIIIIIIIIIIIIIIIIII...
1.0 * IIIIIIIIIIIIIIIIIIII...
1.0 * IIIIIIIIIIIIIIIIIIII...
1.0 * IIIIIIIIIIIIIIIIIIII...
1.0 * IIIIIIIIIIIIIIIIIIII...
1.0 * IZZIIIIIIIIIIIIIIIII...
1.0 * IIIIIIIIIIIIIXIIIIII...
1.0 * IIIIIIIIIIIIIIIIIIII...
1.0 * IIIIIXIIIIIIIIIIIIII...
1.0 * IIIIIIIIIIIIIIIIIIII...
1.0 * IIIIIIIIIIIIIIIIIIII...
1.0 * IIIIIIIIIIIIIIIIXIII...
1.0 * IIIIIIIIIIIIIIIIIIII...
1.0 * IIIIIIIIIIIIIIIIIIII...
1.0 * IIIIIIIIIIIIIIIIIIII...
1.0 * IIIXIIIIIIIIIIIIIIII...
1.0 * IIIIIIIIIIIIIIIIIIII...
1.0 * IIIIIIIIIIIIIIXIIIII...
1.0 * IIIIIIIIIIIIIIIIIIII...
1.0 * IIIIIIIIIIIIIIIIIIII...
⋮)Define some generic quantum circuit
nl = 4
# if we begin with the ZZ layer, it commutes with the |0> state and gives zero gradients
circuit = tfitrottercircuit(nq, nl; topology=topology, start_with_ZZ=false)
nparams = countparameters(circuit)508Importantly, we need to set our truncations. Depending on which package and which method you are using to compute your gradients, you can use different truncations.
ReverseDiff.jl for example is a sophisticated package for automatic reverse-mode differentiation (others may or may not work better). It will build a computational graph that it then differentiates using the chain rule. This is how large-scale neural networks are trained, and is commonly referred to as gradient backpropagation. The challenge here is that the fastest gradients are achieved with a compiled version of the gradient function where the graph for the chain rule is computed once (to the best of our knowledge). In that case, only truncations during the initial computation will be respected. Truncations that we think work well here are max_weight, max_freq, and max_sins, as they do not depend on the particular parameters of the quantum circuit. On the other hand, which paths are explore with truncations such as min_abs_coeff will not be updated (again, to the best of our knowledge) as the gradients are computed. If you want to use such parameter-dependent truncations, you may need to use un-compiled gradient functions.
Mooncake.jl is an up-and-coming package written fully in Julia that differentiates the code itself. It allows for reverse-mode differentiation with truncations like min_abs_coeff working and being differentiated as intended. We hope to provide a tutorial for this library soon.
ForwardDiff.jl is a straight-up improvement over manual numerical differentiation, both in terms of speed and accuracy. Compared to reverse-mode differentiation libraries it is slower for circuits with more than several dozen parameters, but it requires very little additional memory. We will use it for this notebook. Generate some generic parameters:
using Random
Random.seed!(42)
thetas = randn(nparams)One expectation evaluation
min_abs_coeff = 1e-4
max_weight = 5
@time psum = propagate(circuit, H, thetas; min_abs_coeff, max_weight);
overlapwithzero(psum) 2.738709 seconds (2.16 M allocations: 105.062 MiB, 0.84% gc time, 99.44% compilation time)
7.26631983654383Note: To make these following functionsfaster, you should look into let blocks for local variable namespaces. You will want to define closures and refrain from using global variables.
What does not work:
Now wrap these steps into a function that takes only thetas as argument. (keep in mind our comment about using let blocks)
This loss function does not work because the ForwardDiff package needs to propagate its custom coefficient type. But H is already strictly typed. So the following loss function would not be automatically differentiable.
function naivelossfunction(thetas)
# this function is not fully type-stable because min_abs_coeff and max_weight are global variables
psum = propagate(circuit, H, thetas; min_abs_coeff, max_weight);
return overlapwithzero(psum)
endnaivelossfunction (generic function with 1 method)The loss works
@time naivelossfunction(thetas) 2.331813 seconds (6.49 M allocations: 317.200 MiB, 12.13% gc time, 98.95% compilation time)
7.26631983654383But the gradient would break with a cryptic error message.
# using Pkg; Pkg.add("ForwardDiff")
using ForwardDiff: gradient
## this errors
# gradient(naivelossfunction, thetas)What works:
To create a loss function that indeed works, we need the propagating Hamiltonian to have the correct coefficient type within the loss function. What tends to work is to type the coefficients in the Pauli sum to the element type of thetas, but this will not always work. In this case, it will make everything differentiable.
Then we need to use the propagate!() function because otherwise the ForwardDiff library errors or yields zero gradients because we copy the Pauli sum while that library is accumulating the derrivative. Annoying... we agree! We will keep an eye out for better options.
For maximal performance, we will want to make all functions type stable, which can be achieved by a function closure or function factory: A function that outputs a function. In this case, we want to output a loss function that has all arguments other than thetas pre-loaded.
function generate_lossfunction(circuit, H, min_abs_coeff, max_weight)
function lossfunction(thetas)
# differentiation libraries use custom types to trace through the computation
# we need to make all of our objects typed like that so that nothing breaks
CoeffType = eltype(thetas)
# convert H to use the correct coefficient type
# otherwise it would error like above
grad_ready_H = convertcoefftype(CoeffType, H)
# be also need to run the in-place version with `!`, because by default we copy the Pauli sum
output_H = propagate!(circuit, grad_ready_H, thetas; min_abs_coeff, max_weight);
return overlapwithzero(output_H)
end
return lossfunction
endgenerate_lossfunction (generic function with 1 method)# define the loss function that works with AD
lossfunction = generate_lossfunction(circuit, H, min_abs_coeff, max_weight)@time lossfunction(thetas) 0.037470 seconds (22.22 k allocations: 1.156 MiB, 67.65% compilation time)
7.26631983654383And gradients work.
# the gradients of the first ZZ layer is
@time gradient(lossfunction, thetas) 4.212837 seconds (3.71 M allocations: 210.103 MiB, 1.20% gc time, 66.83% compilation time)
508-element Vector{Float64}:
-0.6772329930789089
-0.22543783632035608
-0.42292925319813446
-0.4243076836337615
-0.7523379732310989
-0.07792066987114322
-0.8515536079185417
0.01990979942877011
0.8974187316400088
-0.25922854058847183
0.506894173618708
-0.30613329539066253
-0.8354345999607842
⋮
1.0546803431129719
-0.22261031482081436
0.4973379719992734
0.8133558792992438
0.24969214893578057
0.6969899813204755
0.17190365988833814
0.6511464180672394
0.03797818611225978
0.3563370395924771
-0.21643500323976758
0.046996169676417704You are now ready to go! Plug this gradient function into your favorite optimizer and see what happens.
Remember:
- These gradients are approximate when using truncation.
- Forward-mode differentiation scales linearly with the number of parameters in addition to the extra runtime of simulating a deeper circuit.
- Reverse-mode differentiation can be faster, and drastically so when the gradient is compiled at the cost of large memory overheads.
- You can also propagate each Pauli term in the Hamiltonian individually and add up their gradients. That can be easily multi-threaded.