James Martens has been publishing bags of tips and tricks for largescale nonconvex optimization that occurs in training deep learning network and recurrent neural network (RNN). They were able to train deep learning network without pretraining and better than the stateoftheart, and also for RNN, much better than backpropagation through time. Basically, it’s the use of 2nd order (curvature information) via heuristic modifications of the conjugate gradient (CG) method. CG is Hessianfree since one only needs to evaluate Hessian in the direction of a single direction which is much cheaper than computing the full Hessian (often it is prohibitive for largescale problems). The objective function is repeatedly locally approximated as a quadratic function , and minimized. Some of the tricks are:
 Use conjugate gradient instead of other quasiNewton methods like LBFGS, or nonlinear conjugate gradient.
 Use GaussNewton approximation. For nonconvex problems, the Hessian can have negative eigenvalues which can lead to erratic behavior of the CG step which assumes positive definite . Hence, they propose using the GaussNewton approximation which discards the secondorder derivatives, and is guaranteed to be positive definite. In the following Hessian, the second term is simply ignored.
 Use fraction of improvement as termination condition for CG (instead of the regular residual norm condition).
 Add regularization (dampening) on the Hessian (or its approximation), and update its trustregion parameter via LevenbergMarquardt style heuristic.
 Do semionline, minibatch updates.
 For training RNNs, use structural dampening which limits changing parameters too much that are highly sensitive.
References:

James Martens. Deep learning via Hessianfree optimization. ICML 2010

James Martens, Ilya Sutskever. Learning Recurrent Neural Networks with HessianFree Optimization. ICML 2011