Back propagation optimisess all layer weights to better match the test value using chain rule derivative of error. I guess one proof is it works, but its always nice to know why something works.
https://towardsdatascience.com/understan...c509ca9d0/
https://towardsdatascience.com/understan...c509ca9d0/