Skip to Content

Does adding data increase citation?

Written on February 2, 2017 at 2:31 pm, by

There is a strongly held view that open access to articles and data will lead to more effective research. To this end, it would be useful if the public availability of research data could be linked to benefits for authors, such as increased citation of their work, as this would drive a virtuous cycle.

Piwowar et al (2007) examined the citation history of 85 cancer microarray clinical trial publications [*] and its correlation to the availability of their data. They found that the 48% of trials with publicly available microarray data received 85% of the aggregate citations. They claim that publicly available data was associated with a 69% increase in citations, independent of confounding factors such as journal impact factor, date of publication, and author country of origin.

The graph shows the number of citations received by each of the trials 24 months after publication, ranked by the impact factor of the journal they were published in. The graph also shows which articles were publically available as opposed to those that were not.
Citation rates for 85 articles referencing array data
There is evidently a strong correlation between impact factor and journal’s data deposition policies both of which have a strong correlation with citation. But what is the causal relationship here.

[*] The 85 cancer microarray trials were published before early 2003 and identified in a systematic review by Ntzani and Ioannidis, Ntzani EE, Ioannidis JP (2003) Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment. Lancet 362: 1439-1444

Brazil’s CAPES meets to discuss the internationalization of national research journals

Written on November 22, 2014 at 11:49 am, by

The purpose of the CAPES* meeting held 29 October 2014 was to discuss proposals for contracting international publishers to edit Brazilian scientific journals – as is already happening in China, Japan and other countries.

“We want a model in which Brazilian research journals attain an international standard and visibility, without losing their identity” said President of CAPES, Jorge Guimarães.

During the meeting Elsevier, Emerald, Springer, Taylor & Francis, and Wiley publishers presented their proposals to a selected group of journal editors

For CAPES, the initiative that is part of a greater project aimed at the internationalization of Brazilian universities and their scientific output.

“No Brazilian university has provided to internationalize entirely, but it can internationalize by sectors,” added Guimarães.

Subsequently, in a joint communication, Sigmar de Mello Rode, President of ABEC* and Abel Packer, Program Director for SciELO* expressed concern that CAPES initiative would favour only a small selection of the 280 national journals published on the SciELO platform which together represent 25% of the country’s scientific output.

De Mello Rode and Packer argue that the CAPES program should do more to support the growth of an indigenous publishing industry. The CAPES announcement did not make clear what role it foresaw for SciELO in the future.

Note:

CAPES (the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior) is the Brazilian Federal Agency for the Support and Evaluation of Graduate Education.

SciELO (The Scientific Electronic Library Online) is an electronic library covering a selected collection of Brazilian scientific journals.The library is an integral part of a project being developed by FAPESP – Fundação de Amparo à Pesquisa do Estado de São Paulo, in partnership with BIREME – the Latin American and Caribbean Center on Health Sciences Information.

ABEC is the Brazilian Association of Scientific Editors.

Impact Factor shock horror…!

Written on October 30, 2014 at 2:55 pm, by

If Impact Factors are not a good proxy for assessing the performance of publishing authors then what are they good for? If you plot out the number of citations received by a given journal over a specific period of time then you end up with a distribution that is severely skewed to the right. It is easy to imagine that the very high values are outliers that can’t be captured by any more formal description. Furthermore, if you compare the citation distributions for different journals then more sources of variation begin to appear. Journals with high impact factors are clearly humped, whereas journals with low ratings begin to look more like a simple exponential distribution with a lot of articles never receiving any cites at all. A figure from Dwight J. Kravitz and Chris I. Baker’s paper in Front. Comput. Neurosci. 2011 shows you what I am getting at here.

Citation plots

A couple of months ago when I was working in Brazil, I had some spare time between classes, and had access to a range of information resources, including Scopus, I wondered if there was a way of normalising these citation distributions. I tried binning and ranking the data as percentiles so that the extent of the y-axes would be comparable but this just left me with a curve that looked like a negative exponential, so I tried logging it to see if I got a straight line. Well almost! The curve curled up at the high end and curled down at the low end. But in between it looked fairly straight. Curiosity piqued, I decided to compare a group of life science journals with widely differing Impact Factors, publication volumes and peer-review standards. What I ended up with for a sample of journals with widely differing Impact Factors and editorial policies is shown in the next graph. [Each curve represents the journal's annual output of research papers (not reviews) ranked and sampled 150 times.]

Journals

To my amazement each of the curves I got was broadly similar – the main difference being the positioning of the curves relative to the logged y-axis.

Why should I not have been surprised? Lotka’s Law, formulated in 1926, states that the number of authors making r contributions is about 1/r^a of those making one contribution, where a nearly always equals two. In other words, the number of authors publishing a certain number of articles is a fixed ratio to the number of authors publishing a single article. As the number of articles published increases, authors producing that many publications become less frequent. There are 1/4 as many authors publishing two articles within a specified time period as there are single-publication authors, 1/9 as many publishing three articles, 1/16 as many publishing four articles, etc. Though the law itself covers many disciplines, the actual ratios involved (as a function of ‘a’) are very discipline-specific. Subsequent work has shown that empirically the formula A(N+1-r)^b/r^a (where A is a scaling factor) is a better fit as shown in the next and finalgraph. So all of these different citation distributions belong to a single family of curves differentiated by a single parameter, the average number of citations per percentile.

Lotka

What does this all mean? Two things, I think. First that journals do select and publish a sample of articles that reflect a defined range of citation values. Second, the actual citation numbers are generated according to a random process post-publication. In other words, the Impact Factor is a reflection of the selectivity of the peer review process, but the actual number of citations generated per article has no close relation to quality.

It is also interesting that PLoS ONE has the same profile as other low Impact journals such as Brain Research. Perhaps the new open access mega-journal isn’t so different from the old subscription-style mega-journals such as Brain Research and BBA after all?

Do metrics have a future?

Written on May 20, 2013 at 3:44 pm, by

The publication of the San Francisco Declaration on Research Assessment provides a welcome point of focus from which to debate the value of metrics that attempt to measure the volume and quality contributions made to scientific progress by a country, funding agency, institution or individual researcher.

A major stimulus for this multi-publisher/agency review was the growing animosity of academics and their funders for Thomson Reuters’ Journal Impact Factor(IF), a statistic based on averaged citations originally designed to help librarians identify journals for purchase, but increasingly used as a quantitative indicator of a journal’s quality, and, by implication, the papers published therein and their authors. In reality, the citation contributions of individual articles range over several orders of magnitude (see graphic), so whilst they are de facto a good predictor of the IF, the reverse is not true.

Nature cites3

The pressure to cut the influence of the IF back down to size comes not because metrics are a generally a bad thing, but because, with a growth in scientific production (papers, datasets, etc) funders, institutions and researchers need some form of objective framework to assess and manage performance. The mistake, perhaps, has been to assume that estimators of quantity can also be used to access quality, in this case contribution to scientific progress.

Open access publishing models and digitization of STM content generally have stimulated a growing number of alternative metrics whose variety and usefulness is championed by organisations such as ImpactStory and Altmetric. These new metrics include downloads, data from social media sites such as Twitter and Facebook, and information from online reference managers such as Mendeley. So there is a lot to choose from. But although the Declaration makes a number of recommendations for improving the way in which the quality of research output should not be evaluated, it fails to put its finger on what it is that should be being measured and how.

For example, its advice to Funding Agencies is to “consider the value and impact of all research outputs (including datasets and software) in addition to research publications, and consider a broad range of impact measures including qualitative indicators of research impact, such as influence on policy and practice.”, but not use journal-based metrics, such as Journal Impact Factors, as a surrogate measure of the quality of individual research articles. But what are these new impact measures to be?

Let’s start with citation statistics. Surely the number of times an article appears in a reference list provides some indication of the value of the cited article to the scientific community?

Well, the answer may be not to use statistical metrics at all. Citation may be a powerful form of social communication but it is not an impartial method of scholarly assessment. The distortions in article citation statistics include bias, amplification, and invention. Thus, according to Steven Greenberg, a neurologist who studies the meaning of citation networks, citation is often ends up being used to support unfounded claims which can mislead researchers for decades. More to the point, it is possible to identify the evolution of our knowledge about a problem by characterising individual citations according to whether they present original authorative ideas, are supportive, critical or unfounded.

Qualitative citation typing ontologies, such as the one just mentioned, could be created as part of the funding review process and/or as part of a formal program assessment process to review the contribution of individual projects. Once captured the data could be added to and graphed using open access bibliometric databases such as PubMed.

By making it clearer what the purpose of the measurement is, we stand a better chance of coming up with new metrics that work.

 

eLife: Nothing new under the eSun?

Written on February 28, 2013 at 4:05 pm, by

I did my PhD at UCL in the early seventies when neuroscience was only just beginning to exist as a subject in its own right. It wasn’t until 1980 that the first neuroscience department was established in the UK, so to do neuroscience I had to spend most of my time hopping between the physiology and anatomy departments (with occasional trips to zoology and biophysics to hear the likes of JZ Young and Bernard Katz lecturing.)

Not surprisingly then one of my favorite reads was the Journal of Comparative Neurology, published by the Wistar Institute of Anatomy and Biology in Philadelphia, which had been transformed into a major communications channel for this new integrated discipline by its managing editor, W Maxwell Cowan.jcn

When I became editor of the magazine Trends in Neurosciences (TINS) in 1979, I was lucky to find Max on my editorial board. He, together with other founding fathers of neuroscience such as Louis Sokoloff, Eric Kandel and Ed Kravitz were fierce critics of TINS in its early days, so Max and I spent a great deal of time figuring out how it could be improved. I was fortunate because Max had become the editor-in-chief, responsible for the launch of The Journal of Neuroscience, the Society for Neuroscience’s flagship journal. We gave a lot of attention to streamlining the design of the editorial review processes and came up with at set of principles that as well as working well for primary journals provided a firm basis from which to manage review journals such as Trends too.

This process wasn’t a million miles away from the outline presented recently by eLife’s Editor-in-Chief, Randy Schekman (last slide).

eLife’s formula includes a swift triage process using editorial board members as reviewers, keeping review decisions simple, and requesting only essential revision requirements with quick assessment of revisions. So whilst it is a compelling editorial vision, it is neither new nor unique.

Furthermore, it has a tendency to breakdown as the journal becomes more successful and publication volumes increase – the simple “hockey stick” relationship that you can see often in eigenfactor.org‘s animation plots of the evolving relationships between article impact and journal size.

So let’s wait and see how eLife’s quality metrics develop. They should be good, given the resources available. But will they become sustainably exceptional?

PLoS article level metrics

Written on February 25, 2013 at 3:37 pm, by

PLoS publishes a regular report covering a wide range of metrics covering all of it journals. This is easy to download as a .csv file, from which you can quickly start to create graphical summaries using Excel or STATA. One problem with the way the data is currently presented is that each number represents a cumulative total over time, so causal hypotheses, such as the relationship between citations and downloads are difficult to test.

Many of the metrics included in the report occur fairly infrequently, i.e. there are lots of zeros, and so these variables are probably telling us more about how widespread the use of the different channels is currently, rather than about features of the underlying content. Exceptions include Mendeley, Facebook and Twitter.

Social mediaThe box pots on the right show how the distributions of article references differ by journal (panels A, C & D) and over time (panel B). The three by journal plots use papers published in 2012 as a reference point.  As can be seen in panel B, Mendeley readers continue to add papers for a couple of years where as responses on Twitter and Facebook are more immediate (data not shown).

The Mendeley data generally have a well-defined, non-zero median, whereas Facebook and Twitter are largely made up of outliers – PLoS Medicine is the exception here.

Outlying PLoS ONE articles published 2012 and selected by Mendeley readers tend to be review-like or methods-based…

  • An Study of the Differential Effect of Oxidation on Two Biologically Relevant G-Quadruplexes: Possible Implications in Oncogene Expression
  • Equilibrium of Global Amphibian Species Distributions with Climate
  • Direct Comparisons of Illumina vs. Roche 454 Sequencing Technologies on the Same Microbial Community DNA Sample
  • Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species

…whereas articles frequently posted on Facebook or tweeted seem to have a more popularist appeal:

  • The Power of Kawaii: Viewing Cute Images Promotes a Careful Behavior and Narrows Attentional Focus
  • The Eyes Don’t Have It: Lie Detection and Neuro-Linguistic Programming
  • Lesula: A New Species of Cercopithecus Monkey Endemic to the Democratic Republic of Congo and Implications for Conservation of Congo’s Central Basin
  • Why Most Biomedical Findings Echoed by Newspapers Turn Out to be False: The Case of Attention Deficit Hyperactivity Disorder

 
Associations between these different metrics require more analysis once the data can be analysed as a times series, but there do appear to be some interesting relationships between citation rates and Mendeley readership numbers.

Mapping Social Sciences publishers

Written on February 15, 2013 at 2:25 pm, by

There is an increasing amount of data about STM publishing available freely on the web. Publisher/imprint/journal/ISSN relationships as developed for Elsevier’s Scopus database can be found here. Publicly accessible resources derived from Scopus and Thomson Reuter’s Web of Science can be viewed at Scimago and Eigenfactor. Scimago is especially useful as it contains journal statistics such as the number of documents published which can be downloaded as a CSV file. Elsewhere you can find information about journal pricing.

More detailed document level data can be downloaded from PubMed. This obviously is far more restricted in its field coverage, but it does contain information about article accessibility, useful for tracking the penetration of different open access business models.

SScombo

And if you want to know more about the value of article-level metrics, then head for PLoS.

Most of these different data resources can be zipped together using the journal title, ISSN and document identifiers (DOI) to link specific records.

The two panels alongside demonstrate the sorts of things that can be achieved. I have grouped individual journal publication volumes across publishers and subject areas, calculated growth rates and citation ratios. The database takes a few hours to set up and debug, but then it is easy to formulate complex queries in a few minutes – for example, how does the citation impact and publication volume of a large society publisher depend on its output of Proceedings journals?

This example here is taken from the social sciences and maps publishers in terms of overall size, growth and citation impact and provides a quick reminder of who the major players in this journal ecosystem are.

The size of the circles in the upper panel correlated with the Scimago Journal Ranking scores (an algorithm similar to Google’s PageRank) which are presented in more detail (median, 95 percentiles, outliers) in the box plots below. It would be a straightforward matter to look below at the journal level and to chart changes over time (a topic for a future post).

This picture highlights the wealth of acquisition targets in the discipline (there are many more slightly smaller companies not shown in this view) all of which may shortly be challenged with implementing new open access options for their authors. Alternatively there is a great opportunity for a PLoS ONE clone here, though the experience of SAGE Open seems to suggest that the field still feels uncomfortable with open access ethics and may well choose ultimately to take the green route as the lesser of two evils. This will change…

Do major funding agencies support better research?

Written on January 23, 2013 at 12:54 pm, by

Too many US authors of the most innovative and influential papers in the life sciences do not receive NIH funding, conclude Joshua M. Nicholson and John P. A. Ioannidis in a recent issue of Nature. The authors concluded that three out of five of a carefully selected sample of highly cited principal investigators eligible for NIH support did not in fact receive it.

Not perhaps surprisingly, their claim turned  out to be highly  controversial  (see, for example, rebuttal by  Santangelo and Lipman). Nevertheless, the question posed by Nicholson and Ioannidis is an extremely important one, and deserves further analysis.

NDBslide

Getting funding for research represents a key stage in the research cycle (See earlier blog) and one that is becoming more and more prescriptive in terms of determining what research is actually to be carried out and how. For example, the NIH receives many more grant applications proposing outstanding scientific projects than its budget can support. The overall success rate for grants during fiscal year 2011 was 18% – an historic low (See chart on right). This highly selective vetting of research intentions is becoming the norm.

Most HHMI grants are awarded through competitions that have a formal invitation and review process. Because HHMI awards are intended to achieve specific objectives through clearly established programs, it does not encourage and rarely funds unsolicited grant proposals. Does this process impact on the quality of the science subsequently published?

Nautre citesWe took a look at the citation data generated by articles published in Nature during 2009 and asked whether the levels of citation achieved two years later in 2011  for work supported by several top flight funding agencies (NSF, NIH, HHMI and Wellcome) differed all from other work published in 2009.  The box plot shown here  summarizes these results.

The shaded boxes enclose the 25th and 75th percentiles and the ends of the whiskers indicate the 5th and 95th percentiles.  The lines within the boxes represent the individual medians and the red line indicates the overall median for the 2011 citation sample.  Each of the citation distributions is highly skewed, but the median for the four funding agencies represent a consistent trend to exceed the norm.

Biology is likely to be a confounding factor here – biomedical papers tend the be cited more than non-biomedical ones – but the approach can be further refined by using additional metrics, such as  how sustained paper citations are over time – an indication of whether a paper has been of consistent value over time.

Impact of Open Access in different biomedical areas

Written on January 18, 2013 at 11:19 am, by

Following my last post, I have looked more closely at two of the biomedical areas used in the last analysis. I have used the search terms “epilepsy” and “genomics” to pull off all of the records in PubMed published in these separate areas since 2000.

GenEpi

Then I consolidated these data so that I could compare the numbers of articles published in different journals and use PubMed’s  “full text” and “free full text” limits to compare how much of this published material is available on some form of open access basis, either via PubMed Central or via the publisher’s site.

The results are displayed in the Table above which shows the top 10 journals in each field ranked by the number of field-specific articles they published in 2012 and comparing what proportion of this is available without the need for a subscription.

As predicted, in a well-established area such as epilepsy, commercial publishers like Elsevier and Wiley dominate, whereas in genomics several open access publishers are already prominent. A second very important factor is the proportion of articles published by authors funded by NIH and other mandating agencies.

By looking in this way at the open access statistics from the viewpoint of specific types of end-user it becomes much easier to see how quickly (or slowly) open access business models may disrupt existing journal ecosystems.

Open access – have we reached the tipping point?

Written on January 15, 2013 at 10:38 am, by

“Open access” only really gets interesting when users and product developers have access to a critical mass of content. I’ll leave the definition of “Critical mass” vague at this point, but it probably means something like 80-90% of the most cited content in a field such as biomedical research. But for this to happen the funding agencies across the discipline need to take hold of the reigns and shape the conversion process.

PMCOA

Last year a number of key European funding sources did give their full backing for open access publishing. The UK government announced that it would require much of the country’s taxpayer-funded research to publicly available from April 2013 onwards and the European Commission said that it would require all work funded by its Horizon 2020 research program to be freely available. Can we now begin to imagine an STM publishing environment in which open access is the dominant business model?

As is often the case, the future is a little easier to see in the US where since 2008 the NIH’s Public Access Policy has required scientists to submit final peer-reviewed manuscripts derived from work supported by NIH funds to the digital archive PubMed Central. Initially compliance was an issue, but from Spring 2013 NIH will delay processing of non-competing continuation awards if publications arising from grant awards are not in compliance with the Public Access Policy.

So how much free full text access is there? The graphic above compares the number of articles available as free full text as a proportion of the total volume of full text content available over time on PubMed. Simple key words such as “epilepsy” and “muscular dystrophy” were used to select a particular corpus and the total numbers of articles computed using the “Free full text[sb]” and “Full text[sb]” limiters.

The fraction of available content that is free varies quite significantly across the five fields being highest for “genomics” and the orphan disease, “leishmaniasis”. It is lowest for “epilepsy”. All of the trajectories increase gradually over time reaching levels of 30-50% by 2011. The rapid fall off after that is caused by the embargo periods of 6-12 months imposed by many publishers contributing OA content.

Why the differences between the fields? This question needs more research, but from visual inspection of the PubMed result lists it would seem that new areas such as genomics are driven more by younger journals that have fully embraced the open access model, whereas fields such as epilepsy are still dominated by society and larger commercial publishers that haven’t.

If that turns out to be the case, then the future dynamics of conversion to open access (at least in biomedicine) will be determined by the interaction between the leading funding agencies and top publishers.

This will have two interesting consequences. Firstly, as the funding agencies’ OA mandates become actively enforced authors will turn increasingly to OA publishers in order to meet these requirements with a minimum of hassle, and journals such as PLoS One and repositories like Europe PubMed Central will flourish. Secondly, open access “contagion” will spill over more rapidly into other disciplines such as engineering and chemistry. After all, the Horizon 2020 program is not just about biomedicine!

So, have we reached the tipping point? This analysis suggests that the 80-90% target cannot be reached by incremental growth. Something more has to give.