![]() |
| > Often people demand more decisive decisions, of 'yes'/'no' or concrete action, not shades of grey and fiddling at the margins.
And working with these people is so painful. |
![]() |
| I don't follow - could you explain this with a couple of examples? What would a business proposal look like that is analogous to a nonlinear model vs. one that is analogous to a linear model? |
![]() |
| Discrete linear optimisation is infinitely more complicated than continuous linear optimisation. The former is NP complete, the latter is in P. |
![]() |
| I don't follow - could you explain this with a couple of examples? What would a business proposal look like that is analogous to a nonlinear model vs. one that is analogous to a linear model? |
![]() |
| They actually deal with non-additive "low-order" interactions quite well. In R's mgcv for example, let's say you had data from many years of temperature readings across a wide geographic area, so your data are (lat, long, year, temperature). mgcv lets you fit a model like:
where you have (1) a nonlinear two-way interaction (i.e. a smooth surface) across two spatial dimensions, (2) a univariate nonlinear effect of time, and (3) a three-way nonlinear interaction, i.e. "does the pattern of temperature distributions shift over time?"You still can't do arbitrary high-order interactions like you can get out of tree-based methods (xgboost & friends) but that's a small price to pay for valid confidence intervals and p-values. For example, the model above will give you a p-value for the ti() term, which you can use as formal statistical evidence to say -- at what level of confidence -- a spatiotemporal trend exists. This Rmarkdown file (not rendered sadly) shows how to do this and other tricks https://github.com/eric-pedersen/mgcv-esa-workshop/blob/mast... |
![]() |
| ReLU networks have the nice property of being piecewise linear, but also during training they optimise their own non-linear transformation over time. |
![]() |
| When statisticians talk about linear models, they talk about the parameters being linear, not your variables x_0..x_n. So y = a*sin(x) + b is a linear model, because y is linear in a and b. |
![]() |
| Do you have a useful reference for "3)"?
A common problem I encounter in the literature is authors over-interpreting the slopes of a model with quadratic terms (e.g. Y = age + age^2) at the lowest and highest ages. In variably the plot (not the confidence intervals) will seem to indicate declines (for example) at the oldest ages (example: random example off internet [1]), when really the apparent negative slope is due to the limitations of quadratic models not being able to model an asymptote. The approach I've used, when I do not have a theoretically driven choice to work with) is using fractionated polynomials [2], e.g. x^s where s = {−2, −1, −0.5, 0, 0.5, 1, 2, 3}, and then picking a strategy to pick the best fitting polynomial while avoiding overfitting. Its not a bad technique; I've tried others like piecewise polynomial regression, knots, etc [3],but I could not figure out how to test (for example) for a group interaction between two knotted splines). Also additive models. [1] https://www.researchgate.net/figure/Scatter-plot-of-the-quad...) [2] https://journal.r-project.org/articles/RN-2005-017/RN-2005-0... [3] https://bookdown.org/ssjackson300/Machine-Learning-Lecture-N... |
![]() |
| It sounds like you are assuming the joint distribution of returns in the future is equal to that of the past, and assuming away potential time dependence.
These may be valid assumptions, but even if they are, "sample size" is always relative to between-sample unit variance, and that variance can be quite large for financial data. In some cases even infinite! Regarding relativity of sample size, see e.g. this upcoming article: https://two-wrongs.com/sample-unit-engineering |
![]() |
| They may have been referring to (for example) reported financial results or news events which are more infrequent/rare but may have outsized impact on market prices. |
![]() |
| Are there some papers in particular that you're referring to? Does the second descent happen after the model becomes overparameterized, like with neural nets? What kind of regularization? |
![]() |
| [Submitted on 24 Mar 2023]
Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle
Rylan Schaeffer, Mikail Khona, Zachary Robertson, Akhilan Boopathy, Kateryna Pistunova, Jason W. Rocks, Ila Rani Fiete, Oluwasanmi Koyejo
https://arxiv.org/abs/2303.14151 Double descent is a surprising phenomenon in machine learning, in which as the number of model parameters grows relative to the number of data, test error drops as models grow ever larger into the highly overparameterized (data undersampled) regime. This drop in test error flies against classical learning theory on overfitting and has arguably underpinned the success of large models in machine learning. This non-monotonic behavior of test loss depends on the number of data, the dimensionality of the data and the number of model parameters. Here, we briefly describe double descent, then provide an explanation of why double descent occurs in an informal and approachable manner, requiring only familiarity with linear algebra and introductory probability. We provide visual intuition using polynomial regression, then mathematically analyze double descent with ordinary linear regression and identify three interpretable factors that, when simultaneously all present, together create double descent. We demonstrate that double descent occurs on real data when using ordinary linear regression, then demonstrate that double descent does not occur when any of the three factors are ablated. We use this understanding to shed light on recent observations in nonlinear models concerning superposition and double descent. Code is publicly available |