开放、严谨和可重复的研究：从业者手册（2021）

开放、严谨和可重复的研究：从业者手册（2021）
Open, rigorous and reproducible research: A practitioner’s handbook (2021)

原始链接: https://stanforddatascience.github.io/best-practices/index.html

这份 100 字的摘要概述了“开放、严谨和可重复的研究”，这是在科学研究中实施最佳实践以提高其透明度、可重复性和整体价值的实用指南。该手册为设计仔细的研究、采用有效的数据分析方法和确保研究结果的相关性提供了指导。关键要素包括仔细的研究设计、透明的数据生成和可靠的统计分析计划。此外，该手册强调让所有人都能访问数据和代码，并促进研究人员之间的合作。附录提供了常见问题、特定学科注意事项的列表以及有用资源的链接。它的实用性使其易于理解并适合各个学科的个人使用。总体而言，本指南旨在支持开放科学的进展，同时通过在每个阶段纳入透明度、彻底性和问责制来促进值得信赖的结果。

总结：为了回应最近围绕斯坦福大学的争议，特别是关于其前校长 Marc Tessier-Lavigne 采取的欺诈行为，许多人对是否利用与斯坦福数据科学项目相关的资源（包括最佳实践手册）提出了合理的担忧，可能意味着对受污染的领导层的认可或关联。然而，其他人则认为，大学及其学术部门必须与其行政部门分开看待，并强调考虑内容本身的质量和适用性而不是政治环境的重要性。最终，虽然时间和个人关联可能会影响个人观点，但利用高质量和可靠的材料和资源对于推进各自领域的进步仍然至关重要。无论如何，仍应仔细考虑潜在的声誉影响和问责措施。

原文

This book starts from the premise that there is a lot we can all do to increase the benefits of research.

Let’s consider the main limitations of research that is not carried out and shared in an open, transparent, and reproducible way:

If papers are published in venues that are only available to those who pay for access, the vast majority of the world will not be able to see the output of all the work that went into producing it; this limits the potential reach and benefit to others.
Because of the complexity involved in many analyses, it is nearly impossible to describe every detail and choice that went into an analysis in the main paper; without accompanying code, it can be very very difficult for others to be certain about exactly what was done.
Even if code is made available, there can be additional challenges to reproducing or re-analyzing past work, such as inaccessible data or deprecated software.
If others are not able to easily re-analyze past work, that limits the ability of the community to explore other analysis pathways, combine datasets, attempt to generalize experiments to new settings, etc.
If experiments are carried out without proper care in experiment design and analysis, there are likely to be more erroneous findings in the literature, making it harder for everyone to make sense of the object of study.
The more that new researchers have to wade through results that may not be credible, they more they are delayed from making genuine advances

Of course, there are numerous reasons why people don’t put more effort into making their work open, transparent, and reproducible:

Perhaps most importantly, doing so does require some additional work, and current incentive structures do not necessarily reward these efforts; however, this is changing in many fields, and certain communities place a lot of value on such things. Moreover, the cost of mistakes can be high, and this sort of openness helps to avoid them.
Some data is legitimately not possible to share, due to concerns about privacy, copyright, or other considerations. Using such data will generally be less useful to the world than using more open data, but some work will of course require it. However, there are still things that can be done to avoid the worst problems, including being transparent about the analyses carried out, the protocol for collecting data, and other techniques such as pre-registration, which can bolster people’s confidence in a piece of work.
Many people worry that making their data and code open to the world will expose them to risk or ridicule, either because they fear they have made mistakes, or they think it will reveal them to be a poor coder. This is understandable, but generally misplaced. It is better to catch errors early. Moreover, most people will be happy if you share any code, no matter how bad it is, and doing so is one of the best ways to improve, especially if you begin with the end in mind.
Finally, many people don’t know where to start. Most guides to open science and reproducibility take the form of complete books or corpus, and try to teach an entire philosophy and comprehensive approach to research, which can be overwhelming.

In this document, we take a different approach. Our main goal here is to show how there are many ways to make your research more open, transparent, and reproducible on the margin, and that each step in that direction may bring some benefit. While there will always be nuances and requirements specific to each field, in general there is a great deal that we can learn from each other, and most ideas can be applied to any domain.

In summary, this handbook is a guide to making science more open, transparent, and reproducible by presenting best practices in a way that is:

modular: individual ideas can be used separately or combined
practical: focused on the most tractable and impactful practices
general: applicable to any field that works with data and statistical analysis
concise: aimed at the busy scientists who doesn’t have time to take a full course right now

We break this guide down into three mains sections. Each section contains many modular components, each of which can be considered and used independently or in combination with the others:

Section 1: Careful study design to help ensure and demonstrate that results and conclusions are valid and useful:
- Thoughtful determination of experimental parameters, such as using power analysis to estimate an appropriate sample size
- Distinguishing between exploratory and confirmatory research
- Pre-analysis planning of statistical analyses
- Ensuring that all relevant data is collected in order to be comparable with past work
- Additional considerations, such as pre-registration, planning for potential problems, and consideration of ethical implications.
Section 2: Adopting best practices in analyzing data and reporting results:
- Preliminary: decisions and considerations before working with any data.
- Statistical analysis plan: plan your analytic approach beforehand.
- Data generation: generate an appropriate set of data.
- Data preparation: transparently prepare your data for data analysis.
- Data visualization: visualize all data using informative visualizations.
- Data summarization: summarize all data using appropriate statistics.
- Data analysis: analyze all data and avoid common blunders.
- Data analysis - medicine: a few more considerations for medical research.
- Statistical analysis report: report transparently and comprehensively.
- Examples: published literature exemplifying principles of this manual.
Section 3: Making relevant research materials available to all:
- Open Data: making the raw data available for further research and replication
- Open Source Code: making the analysis pipeline transparent and available for others to borrow or verify
- Reproducible Environments: making not just the data and code available for others, but making it easy for them to re-run the analysis in an easily reproducible manner
- Open Publication Models such that anyone can see the scholarly output associated with the work
- Documenting Processes and Decisions: making it clear to interested parties not only what was done and how, but also why, by mechanisms such as open lab notebooks

In addition, appendices cover additional resources, such as frequently asked questions, discipline-specific considerations and linked to additional resources (of which there are plenty!)

Authors

Dallas Card is a postdoctoral scholar with the Stanford NLP Group and the Stanford Data Science Institute. He received his Ph.D. from the Machine Learning Department at Carnegie Mellon University.

Yan Min is a Ph.D. candidate working with Mike Baiocchi as part of the Department of Epidemiology and Population Health at Stanford University. She has previously completed her medical studies in China.

Stylianos (Stelios) Serghiou is an AI Resident at Google Health working on using modern methods of data science to empower patients, doctors and clinical researchers. He received his Ph.D. in Epidemiology and Clinical Research and Masters in Statistics from Stanford University, where he was advised by John Ioannidis. He previously completed his medical training at the University of Edinburgh, UK.

Acknowledgements

We would like to thank all early readers of this work, who’s feedback we sincerely appreciate. We are especially thankful to the Stanford Data Science Initiative community, Russ Poldrack, John Chambers and Steve Goodman.