# Peer review and gender bias: A study on 145 scholarly journals – Science Advances

## Abstract

Scholarly journals are often blamed for a gender gap in publication rates, but it is unclear whether peer review and editorial processes contribute to it. This article examines gender bias in peer review with data for 145 journals in various fields of research, including about 1.7 million authors and 740,000 referees. We reconstructed three possible sources of bias, i.e., the editorial selection of referees, referee recommendations, and editorial decisions, and examined all their possible relationships. Results showed that manuscripts written by women as solo authors or coauthored by women were treated even more favorably by referees and editors. Although there were some differences between fields of research, our findings suggest that peer review and editorial processes do not penalize manuscripts by women. However, increasing gender diversity in editorial teams and referee pools could help journals inform potential authors about their attention to these factors and so stimulate participation by women.

## INTRODUCTION

The academic publishing system shows a systematic underrepresentation of women as authors, referees, and editors (1). This underrepresentation is persistent (2, 3) and well documented in various fields of research (46). While some previous studies have found no substantial productivity gap in specific fields, in more recent cohorts of academics or when using less biased output measures (7, 8), a recent study of 1.5 million academics suggested that the relative increase of participation of women in science, technology, engineering, and mathematics (STEM) fields over the past 60 years has not reduced the gap in women’s academic productivity and impact (9). Even in fields such as the humanities, psychology, and the social sciences, where the gender composition of the community has been more favorable to women for decades, men still publish more manuscripts and in more prestigious journals (10, 11). In the current hypercompetitive academic environment, such a publication gap could explain why women often have a higher probability of dropout from academia, fewer grants, lower salaries, and less prestigious careers (1, 12).

In this context, scholarly journals are often blamed for this gender gap (13, 14). However, whether peer review and journal editorial processes are the root cause of these gender penalties is disputed (15). On the one hand, recent reports from journals in specific fields, especially in political science, suggest that editorial processes do not discriminate against women (1619). For instance, a recent study of four leading journals in economics also found negligible effects of gender on the assessment of manuscripts (20). On the other hand, recent research in other fields, such as ecology, found that manuscripts submitted by women as first authors received slightly worse peer review scores and were more likely to be rejected after peer review (21). While the publication gap between men and women is generally explained by persistent differences in submission rates by women in almost all fields of research, it is unclear whether peer review and editorial processes contribute to it.

Furthermore, the fact that women are systematically less involved in peer review and are rarely appointed to prestigious editorial positions (13, 14, 22) could influence women’s perceptions of their adequacy and potential success as authors. For instance, recent research suggests that women would submit fewer manuscripts, of comparably higher quality than those written by men, because they anticipate possible editorial bias and invest more in their manuscripts (23). A recent survey of a sample of 2440 American Political Science Association members revealed that women prefer not to target certain journals as they perceive that they will have lower chances than men with similar expertise (24).

Unfortunately, establishing whether peer review and editorial processes have any direct or indirect effect on the lower rate of publications by women is difficult (21, 25). It is likely that previous research did not achieve a consensus in findings because data were either case specific or could not capture all the internal steps at journals that might reveal potential bias. The fact that research has never been performed at a scale sufficient to provide insights in different fields of research and journal contexts has made comparison difficult and has not helped to understand whether specific models of peer review, e.g., single versus double blind, could trigger gender bias (26).

Our study aims to fill this gap by providing the first in-depth analysis of peer review and editorial processes in a large sample of scholarly journals in different fields of research that considers editorial processes as a set of interlinked decisions. We concentrated on three possible sources of bias, i.e., the editorial selection of referees, referee recommendations, and editorial decisions, and examined all their possible relationships while controlling for important confounding factors such as journals’ field of research, impact factor, and single- versus double-blind peer review. Because of an agreement on data sharing with some of the largest scholarly publishers (27), we collected complete and fully comparable temporal data on 145 scholarly journals, including almost 350,000 submissions by about 1.7 million authors and more than 760,000 reviews performed by about 740,000 referees (see Materials and Methods). To the best of our knowledge, this is the first study that includes data on manuscripts and reviewer scores across journals from different publishers and fields of research of sufficient depth to assess whether peer review and editorial process contribute to the gender gap in publications.

## RESULTS

Table 1 shows the distribution of journals by fields of research in our sample, the proportion of women among authors, and other summary statistics. Our data confirmed previous research on gender disparities in manuscript submissions and peer reviewing (13, 14, 18, 28), with 75% of men among submission authors and 79% of men among referees. As expected, we found differences between journals from different research fields, with the greater gender gap in the rate of women among authors and referees in physics (only 19% women as authors). In addition, women are less involved in peer review compared to their authorship rate in all domains except for social sciences (38% women as authors and referees; Table 1). While this could reflect the different rate of adoption of diversity and inclusion policies in some journals, it is more probable that these distortions simply reflect differences in the gender composition of the potential pool of authors and referees, which is impossible to estimate.

Table 1 Number of journals and frequency distribution of selected sample characteristics by field of research.

Figure 1 shows an overview of the distribution of the final editorial decisions on manuscripts by gender of the first and last author and field of research. This picture suggests a certain degree of diversity among fields, e.g., manuscripts by women would be accepted more frequently in biomedicine, health science, and social science journals, and less frequently in life science journals. However, these descriptive statistics do not allow us to consider the potential effect of important covariates such as the journal’s impact factor, the number of coauthors, and the review scores, which would be essential in untangling potential sources of bias during the editorial and peer review process. Note that data on desk rejections were not consistently available, and so, we concentrated on manuscripts that were not desk-rejected by editors.

To examine these processes more systematically, we performed robust statistical analysis within a Bayesian framework and estimated different models on the dataset (see Materials and Methods). We first looked at the editorial process by considering each of the following steps separately: (i) the editorial selection of referees, (ii) the referee recommendations, and (iii) the editorial decision on the manuscript. All these steps included specific actions performed by either referees or editors that could reveal a bias. Following previous research and based on data availability, we considered both the position of women in the author list (i.e., whether they were first or last authors) and the proportion of women among the authors as main predictors (10, 13, 21) while controlling for the proportion of women among the referees, the impact factor of the journal, the number of authors in each manuscript, and the type of peer review adopted by the journal (29, 30).

Given that the effect of many of these variables, and crucially of the gender of first and last authors, is likely to be different in each field of research, we estimated separated models for each field. This allowed us to consider field specificities, including the journal prestige and potential diversity of evaluation standards, through in-depth data that have never been available before in this type of research (15). We then built a Bayesian-learning network model (31) to estimate the effect of complex interactions more systematically and understand the extent and persistence of gender bias across all steps of the editorial process (see Materials and Methods).

Regarding the editorial selection of referees (step i), we found that in all fields of research, manuscripts with a higher proportion of women among the authors were more usually reviewed by women referees (see table S1). This is consistent with previous research (13) and was confirmed after controlling for the number of authors in the manuscript, the journal’s impact factor, and the type of peer review model (single versus double blind). Whether such author-referee gender matching is due to any intentional preference or deliberate practice of journal editors or simply reflects an unequal distribution of men and women in expertise and fields of research is beyond the scope of this study. Our findings here simply indicate that manuscripts by women were not differently treated because of being usually reviewed by men.

READ  Doubt is essential for science – but for politicians, it's a sign of weakness - The Guardian

Furthermore (step ii), we found that manuscripts by women received systematically more positive reviews in biomedicine and health sciences, as well as in social sciences, whereas they were less positively treated in life sciences (weak statistical effect) and physical sciences journals (strong statistical effect). Women tended to provide more positive recommendations than men in all fields but physical sciences. This effect was consistent after controlling for all other variables and can therefore not be explained by the gender matching between referees and authors or other factors (see table S2). The fact that our model could only explain a small fraction of the outcome variance (between 4 and 11%, depending on the field of research), though many model coefficients were significant, suggests that other manuscript characteristics that we could not measure, such as its quality and content, had the strongest effect on referee recommendations. This effect was independent of any editorial matching or referee selection options.

To check whether our results were robust in taking into consideration alternative specifications of our gender variable, we estimated two further models that considered (i) whether a woman was first or last author of a manuscript (see table S3) and (ii) whether effects were different for our five mutually exclusive groups of authors: a man sole author, a woman sole author, all-men teams, all-women teams, and co-ed teams of authors (see table S4) (10). In general, our results show that the author gender did not have a consistent effect, although we found the emergence of complex patterns of interaction when the field of research of journals and the specific composition of author groups were taken into account (for a more systematic analysis of these complex interactions, see the Bayesian-learning network below).

Regarding the final editorial decisions (step iii), we found that manuscripts with a higher proportion of women among authors were accepted more frequently in biomedical, health sciences, and physical sciences journals (strong statistical effect), whereas no evidence of any effect of the gender variable was found in life sciences and social sciences journals. Note that in case of biomedical and physical sciences journals, the positive effect was robust across variation of contexts and controlling for the referee recommendations and the journal’s field of research (Table 2). Furthermore, considering the review scores (for details on referee recommendations, see Materials and Methods), our models were able to explain over 80% of the outcome variance.

Table 2 Logistic mixed-effects models on the final editorial decision (accept) by field of research using the gender ratio as predictor.

Mean estimate, 95% CI, and Bayes factor (β > 0) are reported for each variable.

Alternative specifications of the gender variable did not lead to any systematic difference in the gender effects mentioned above, although resulting in less clear-cut results than in previous models (Table 3). When we considered the gender of the first author, we found that manuscripts by women were more favorably treated in physical sciences journals (strong statistical effect) and less in life sciences journals (weak statistical effect). Being the last author had no significant effect on acceptance, except for a weak negative effect in case of biomedical and health sciences journals. We did not find any systematic bias against manuscripts submitted by women across journals and disciplines when considering the four author groups mentioned above (see table S5).

Table 3 Logistic mixed-effects models on the final editorial decision (accept) by field of research using the first and last author’s gender as predictors.

Mean estimate, 95% CI, and Bayes factor (β > 0) are reported for each variable.

Last, to consider the whole editorial process in which indirect opportunities for bias may exist and assuming that complex interactions among variables could affect editorial decisions, we estimated a Bayesian-learning network including all the previous steps of the analysis. After learning coefficients and conditional probabilities through maximum likelihood estimation, our model was able to predict with 82% accuracy whether or not a manuscript would ultimately be accepted by the editor (see Materials and Methods). Figure 2 shows that after controlling for all direct and indirect effects of all variables, the effect of authors’ gender on referee recommendations depended on the field of research. While manuscripts with a higher proportion of women among authors received slightly more positive recommendations in journals from social sciences and biomedical and health sciences, referee recommendations were slightly more negative for manuscripts submitted to life and physical sciences journals. However, even when comparing the extreme cases where manuscripts were authored exclusively by women or men, our model predicted a change in review scores by less than 4%, showing that these effects were minimal.

Note that while the directionality of paths is necessary to estimate path coefficients in Bayesian networks, the direction of arrows does not necessarily imply causation (32). Variables on a path between two other variables are equivalent to mediating/moderating variables in statistics. For instance, the Bayesian network identified certain paths systematically leading to a higher probability of manuscript acceptance: While, as expected, the highest path coefficients for the prediction of an acceptance were the review score, a higher proportion of women as referees in interaction with a high proportion of women as authors also predicted whether a manuscript was accepted.

Tables S10 and S11 show further statistical tests on some interactions shown in the Bayesian network. We found that manuscripts written by women received better reviews when reviewed by other women in all scientific fields, although the effect was weak in case of manuscripts submitted to journals in life sciences. Manuscripts by women generally received worse reviews in social science journals using single-blind peer review (see table S10), but these journals are the minority in a field typically dominated by double-blind peer review. We also examined whether manuscripts written by women needed to be of higher quality to be published, by checking whether there was a negative interaction effect between authors’ gender and the review score on the editorial decision. We found that such an interaction exists only in case of journals in biomedical and health sciences, while we found only weak effects in the case of journals in social sciences (see table S11).

Although we could not directly estimate the intrinsic quality of manuscripts (if this were possible even only in theory), we used the recommendations of referees as a control variable of the quality and used it to identify bias in the editorial decision. Our results indicated no statistical gender gap in acceptance rates. The Bayesian-learning model found that, after controlling for all other variables (including the recommendations), manuscripts by women were more likely to be accepted in journals of all disciplines except social sciences, where we did not find any significant gender difference. To quantify the effect of gender, we used the model to predict the final acceptance of all manuscripts in our dataset with the hypothetical scenario that all authors were either men or women. In case of biomedical and health sciences journals, manuscripts written by women were predicted to be 5% more likely to be accepted than manuscripts written by men (women were predicted to be accepted in 45% of cases). While in the case of life and physical sciences journals, this probability decreased to 1.5% (for women, the prediction was 53% in both fields), in the case of social sciences journals, the probability was close to zero (with a predicted overall acceptance of 38% of manuscripts). This suggests that women are treated less favorably in the field of research where the ratio of women among authors is the highest (38% in social sciences versus 19% in physical sciences). Figure 3 shows the predicted editor decisions by authors’ gender, controlling for different review scores. Last, the Bayesian-learning network further confirmed a systematic effect of gender on the match of authors and referees.

Given that peer review typically includes multiple rounds of revision, we also looked at the extent to which the length of the revision process could be influenced by the gender of authors and referees. Table 4 shows the estimates of a Poisson regression, which predicted the number of revision rounds that any manuscript eventually underwent before publication. We did not find any effect of gender on the number of required rounds of revision before publication. With the exception of journals in social sciences, the more women among the reviewer team, the higher the probability of more rounds of revisions before publication.

## CONCLUSIONS

Although we could not perform a large-scale, multi-journal randomized experiment and worked only on existing journal data, our findings indicate that manuscripts submitted by women or coauthored by women are generally not penalized during the peer review process. We found that manuscripts by all women or cross-gender teams of authors had even a higher probability of success in many cases. This is especially so in journals in biomedicine, health, and physical sciences, thereby confirming previous research (16, 18, 22).

However, given that we did not have an objective or predefined estimation of the quality of manuscripts (if any) and could use only referee recommendations as an indication, this positive inclination by referees and editors could simply reflect some intrinsic characteristics of the manuscripts. Previous research suggests that women could be inclined to invest more in their manuscripts to prevent expected editorial bias (10, 33), which could also explain why they submit fewer manuscripts (18, 23, 24, 28). In this respect, the fact that manuscripts by cross-gender teams of authors received systematically more positive treatments in our sample could even reveal an exploitation opportunity by men, who benefit from collaborating with women colleagues.

Unfortunately, while the potential positive effect of higher inclusion of women in scientific networks has also been found in other studies (10, 34), our dataset did not permit us to control any possible distortions in the potential pool of authors and referees available in each journal, age cohorts, or other (institutional/personal) status characteristics. Therefore, it is impossible to understand whether these potentially positive effects penalize older women and/or authors from less prestigious institutions (14). This also applies to the gender matching of authors and referees, which is in line with previous research (13). Rather than reflecting any editorial bias, this could simply reveal a gendered concentration of expertise in specific fields or a downstream effect of gendered patterns of citations (e.g., women/men authors citing in their manuscripts more references from women/men, who are possibly used by editors for referee selection).

It is worth noting that besides the lack of an objective measure of the quality of manuscripts, which is problematic and probably even impossible to establish consistently across fields, there are potentially important factors that are not included in our dataset. Some of them could be at least potentially minimized with extensive data search, such as the effect of authors’ academic affiliation; others are impossible to capture, such as the role of authors’ seniority and reputation, especially considering the scale and the cross-discipline nature of our dataset. For instance, it is extremely difficult to estimate the gender composition of various communities to calculate the potential pool of authors and referees in each journal, while we do not have robust proxies of authors’ investment in manuscripts to estimate gender differences in submissions and volume of output (23).

In any case, our findings do not mean that peer review and journals are free from biases. For instance, the reputation of certain authors and the institutional prestige of their academic affiliation, not to mention authors’ ethnicity or the type of research submitted, could influence the process, and these factors could also have gender implications (30, 35). Here, data on the demographic composition of each disciplinary community and data on the invitation and acceptance to review at the journal level could help to complete our picture. On the other hand, these distortions could reflect built-in gendered norms and expectations, which could then persist and be reproduced either consciously or not, even when their expected “true” effects have disappeared (33). Considering the persistent and usually non-acknowledged obstacles that women still face in hyper-competitive academia (36), these expectations would be consistent even if the editorial processes of a set of journals were not objectively biased against women (24).

Our findings suggest that promoting more gender diversity in editorial teams and pools of referees could help scholarly journals to inform potential authors and referees about their attention to these factors and to stimulate the inclusion and participation of women (24, 37, 38). While diversity is beneficial for science and innovation per se (37), in this case, it would also be a signal that could contribute to reshaping the social construction of gender categories in academia and help scholarly journals to increase submission rates by women. Unfortunately, our research could not examine these complex expectations and norms characterizing academic life across all its spectrum, including academic choices of priorities and specialties (5, 39), and educational stereotypes (40).

As previously stated, our aim was to concentrate on peer review, which is an important process determining the quantity and prestige of scholars’ publication, while contributing to shaping their reputation in the community. However, studies capable of combining academic standards of promotion and the effect of author prestige and institutional affiliation on editorial process in scholarly journals are required to examine the complex nexus of gender discrimination (and even other sources of bias) in academia (33), including reconstructing the gender gap–gender bias link in a comprehensive manner. However, this raises the problem of data availability (26). While data sharing on editorial processes of journals should be encouraged more systematically on a large scale with collaboration between publishers and independent research groups (27, 41, 42), examining structural mechanisms that determine academic opportunities requires data integration from various sources (i.e., funding agencies, academic institutions, and scholarly citation databases). Only collaborative efforts on data sharing by various stakeholders will help us to grasp all the pieces of this gender puzzle.

## MATERIALS AND METHODS

### Data overview

Our dataset included internal data for 157 scholarly journals between 2010 and 2016, of which 61 were in biomedicine and health, 50 in physical sciences (including engineering and computer science), 24 in life sciences, and 22 in social sciences and humanities. Details on journal selection and the protocol for data sharing are provided in the Supplementary Materials. Data consisted of all actions or events performed by one of the journal editors, such as inviting referees, receiving reviews, or deciding about manuscripts. They included 753,909 submitted manuscripts, of which 389,431 (51.7%) were sent out to referees.

To ensure better comparability of peer review and editorial standards, in our analyses, we only considered journals included in the Journal Citation Report based on the Web of Science (WoS) and with an impact factor (98% of our observations, see fig. S1).The resulting dataset included 145 journals and 348,223 submissions. Because of a few missing observations in the data, the actual numbers of complete observations used in the analysis were 348,118 (Table 1). These included a total of 1,689,944 authors and 745,693 referees, with an average of 2.1 completed reviews per manuscript.

The dataset includes the following variables: Manuscript ID, unique manuscript identifier; SubmissionDate, initial submission date; JournalID, unique journal identifier; ScientificArea, journal’s field of research (scientific area); PRType, peer review type; IFRounded, journal’s impact factor rounded to integer (this was to ensure journal’s anonymity); nAuthors, number of authors; NumRounds, number of review rounds; Agreement, referee agreement score; nRev, number of referees; RevScore, review score; AutRatFem, ratio of women authors; RevRatFem, ratio of women referees; FirstAuthorGender, gender of the first author; LastAuthorGender, gender of the last author; FinalDecision, final editorial decision.

The number of manuscripts reviewed by these journals was approximately constant over time, with about 50,000 editorial decisions per year, and a majority of records from physics and biomedicine and health journals (see fig. S2). Given that we aimed to focus on the peer review process, we considered each submitted manuscript as our unit of analysis. Statistics showed that the proportion of accepted papers varied across scientific fields, from 51.9% in life sciences to 37.7% in social sciences (see fig. S3).

Referee recommendations were combined so that a review and an agreement score were calculated for each manuscript (29). While in (29) the former was bounded in the [0,1] interval, we multiplied these with 100 to make estimates in the table more informative. The review score was calculated independently of the number of referees, with higher values reflecting more positive referee recommendations. Following (29), the agreement score was calculated in the same interval, with higher values meaning a stronger agreement between referee recommendations (29).

More specifically, to calculate review scores, we first recoded each referee recommendation (which sometimes appeared as nonstandard expressions in our database) in a standard ordinal scale: reject, major revisions, minor revisions, accept. We then derived the set of all possible unique combinations of recommendations for each manuscript (from now on, the “potential recommendation set”). Using this set, we counted the number of combinations that were clearly less favorable (#worse) or more favorable (#better) than that actually received by the manuscript (e.g.,{accept, accept} was clearly better than {reject, reject}). Last, we calculated the score of each manuscript as follows

READ  Moon's crust richer in metals like iron and titanium than Earth's - Daily Mail
$reviewScore=#worse#better+#worse$

(1)

Note that while (29) calculated a disagreement score, here, we assumed an agreement score for each manuscript, i.e., one minus the number of referee recommendations that should be changed to reach a perfect agreement between referees divided by the number of referees assigned to the manuscript. This permitted full comparability between manuscripts receiving a different number of reviews.

### Statistical analysis

We estimated our mixed-effects models using the R 3.6.1 platform (43). Our plots were generated using the ggplot2 package on the same platform. In all linear and logistic mixed-effect models, we included random effects for journals. We tested all model specifications including nested random effects for journals by considering the potential distortions due to sampling by publishers and found no effect on results. To comply with the data sharing protocol, we did not report details here to avoid journal identification. Mixed-effects models were estimated using the brms package (44) and are the outcome of four independent chains, each including 10,000 iterations (5000 burn-in + 5000 sampling). To ensure that the estimates are reliable, we checked that all scale reduction factors (

$R̂$

) (45) were below 1.01. In each table, we reported the coefficients’ mean estimates, 95% credible intervals (CIs), and the Bayes factor corresponding to the hypothesis β > 0. The interpretation of Bayes factors was done following the recommendations in (46). To compute the proportion of variance explained by the models (pseudo-R2), we used the approach proposed in (47). All models used flat priors with a zero mean for all model parameters.

### Bayesian network

Our analysis followed a previous study on network effects on peer review in four journals (29). Building a Bayesian network was pivotal in modeling complex interactions between variables and potential indirect paths of bias (31). We selected this method over alternative machine learning techniques (e.g., neural networks) as it allowed us to generate a directed acyclic graph that was more appropriate to examine the structure of relations characterizing the editorial process. Furthermore, this graph permitted us to calculate the probability of an event (e.g., a rejection) depending on the value of other variables of interest (e.g., all authors being men).

The Bayesian network was estimated using the bnlearn package. We first trained the network on a random sample of 80% of all available manuscripts, while the other 20% were used as independent test data for model validation. Note that all nodes corresponded to the variables used in the statistical models presented in the main text. The structure of the Bayesian network and the direction of influence were learned through various constraint- and score-based structure learning algorithms. All algorithms resulted in structurally similar graphs, which were then aggregated in one network by including all links learned by at least 70% of structure learning algorithms. Figure 2 shows the resulting network. Note that we only imposed restrictions on the structure learning algorithms such that links pointing from the review score and the editorial decision to any of the other nodes were not allowed, as were any links that were chronologically impossible.

It is worth noting here that our data were imbalanced in respect of certain variables considered in the Bayesian network. This is the case of the lower amounts of women among submission authors and the overrepresentation of manuscripts from physical sciences. On the one hand, this, in principle, implies that the learned structure of the network cannot be fully generalized to all manuscripts. However, all model diagnostics showed that these imbalances did not affect our results (see table S6). Therefore, we decided not to rebalance data manually, which would have been difficult given the amount of variables characterizing our dataset and, in any case, would have led to loss of information.

### Gender guessing

The method used for gender guessing was inspired by previous research (1, 13, 48) and prioritized accuracy above other considerations (49). We followed a standard disambiguation algorithm recently validated on a dataset of scientist names extracted from the WoS database and tested with the same time window used in our study (50).

Gender was estimated for each individual record following a multistage gender inference procedure consisting of three steps, in order of priority. First, we performed preliminary gender guessing using, when available, gender salutation (i.e., Mr., Mrs., Ms., etc.). Second, we queried the Python package gender-guesser about the extracted first names and country of origin, if any, to corroborate our procedure. To maximize accuracy, we did not follow gender-guesser for names classified as mostly_man, mostly_woman, andy (androgynous), or unknown (name not found). Previous research shows that gender-guesser achieves the lowest misclassification rate and minimizes bias (50). We then queried the best performer gender inference service, Gender API (https://gender-api.com/), and used the returned gender whenever we found a minimum of 62 samples with at least 57% accuracy. These confidence parameters for Gender API permitted us to comply with the optimal values ensuring that the rate of misclassified names did not exceed 5% [see Benchmark 2 in (50)].

As a result, we were able to estimate the gender of 82% of referees and 77% of authors (table S7). The remaining scientists were assigned an unknown gender, a proportion that is in line with up-to-date nonclassification rates for names of scientists found in literature (50). This method is robust because it implies that a human coder would hardly be able to identify these uncertain gender cases, thereby potentially introducing further bias, if involved.

Our three-step gender guessing procedure was mostly based on gender-guesser (table S8), which is currently the best tool to assign names by origin. We estimated gender of 57% of authors and 63% of referees from this library, which also showed a fraction of misclassification under 5% [see table 6 in (50)]. Note that the validation performed by (50) limited misclassification to 1.5% for European names, 3.6% for African names, and 6.4% for Asian names [see table 5 in (50)]. We followed Gender API to assign the gender to 13% of referees and 16% of authors. The percentage of misclassification of this gender service was 2.1% for European names, 4.7% for African names, and 11.2% for Asian names [see table S5 in (50)]. Last, salutation was used to guess gender of 4% authors and 6% referees.

## REFERENCES AND NOTES

1. E. Hengel, Publishing while female, in Women in Economics, S. Lundberg, Ed. (CEPR Press, 2020), pp. 80–90.

2. K. B. Korb, A. E. Nicholson, The causal interpretation of Bayesian networks, in Innovations in Bayesian Networks (Springer, 2008), pp. 83–116.

3. R Core Team, R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2018).

4. F. Karimi, C. Wagner, F. Lemmerich, M. Jadidi, M. Strohmaier, Inferring gender from names on the web: A comparative evaluation of gender detection methods, in Proceedings of the 25th International Conference Companion on World Wide Web, WWW ’16 Companion (International World Wide Web Conferences Steering Committee, 2016), pp. 53–54.

Acknowledgments: We would like to thank the IT office staff of all partners for support on initial data extraction. The analysis was carried out exploiting the Linnaeus University Centre for Data Intensive Sciences and Applications high-performance computing facility. Access to data was possible thanks to the “PEERE Protocol for Data Sharing,” co-signed by all involved partners on 1 March 2017. Funding: This work was supported by the TD1306 COST Action “New Frontiers of Peer Review.” This work was also partially supported by the Spanish Ministry of Science, Innovation and Universities (MCIU), the Spanish State Research Agency (AEI), and the European Regional Development Fund (ERDF) under project RTI2018-095820-B-I00. A preliminary version of the manuscript received confidential comments by J. Marsh and A. Marengoni. Author contributions: F.S. designed the study and wrote and revised the manuscript. F.G. coordinated data collection and wrote and revised the manuscript. P.D. collected and prepared data. G.B. designed and performed the analysis and wrote and revised the manuscript. M.F. designed the analysis and wrote and revised the manuscript. A.M. contributed to the study design and wrote and revised the manuscript. M.W. and B.M. contributed to the study design, provided data, and revised the manuscript. A.B. contributed to the study design and revised the manuscript. Competing interests: B.M. declares a competing interest, being currently employed as Reviewer Experience Lead at Elsevier. A.B. declares a competing interest, being currently employed as Executive Editor in the Computer Science team at Springer Nature. M.W. declares a competing interest, being currently employed as Researcher Advocate, Content Peer Review at John Wiley and Sons and having stock options in John Wiley & Sons, his employer. Neither of them had access to the database, elaborated any version of the dataset, or were involved in data analysis. The authors declare no other competing interests. Data and materials availability: Our dataset is made available at https://dataverse.harvard.edu/privateurl.xhtml?token=7b70ab08-b062-4589-b024-4584a130ed06 with all records required to rerun our analysis.