Introduction

Research institutions build their research agendas in an ever-more international, dynamic, and diverse landscape. Many of these efforts are increasingly cross-disciplinary, either in collaboration or competition. Various institutions are motivated to inform their activities through the analysis of available bibliometric data. Understanding a specific research landscape is critical to steer global and multidisciplinary research agendas (Wallace & Ràfols, 2015, 2018). Different approaches tend to emerge in research areas with an international scope, gathering varied research interests from diverse origins and disciplinary backgrounds. One way of observing these complex groupings is to assess emerging trends over time (Ràfols et al., 2010; Zitt & Bassecoulard, 1994). It is possible thus to describe the emerging tendencies and follow their evolution.

However, such an interpretative approach may reveal not so much the strategic choices of authors and affiliated institutions but rather the limitations experienced by researchers in their academic environment (Cruz-Castro & Sanz-Menéndez, 2018). Institutional structures tend to reflect the history of the scientific endeavor (Trow, 1999, p. 4). These questions are important when considering the emergence and circulation of new thematics. Local interests may differ from those appearing in the global arenas, but research in other languages tends to be overlooked in English literature (Chi, 2012). How global transformations of research affect local scientific choices is not only a challenging and interesting question; it is also central to understand the growing globalization processes (O’Brien & Arvanitis, 2019).

We present the work done at the Global Research Institute of Paris (GRIP), a new French Institute willing to map, from an institutional perspective, research areas with a global scope. In particular, GRIP wanted to investigate the possible distance between its own strategic choices and those found in the French and international literature. We have chosen an extended period, from 1980 to 2020, to better grasp the knowledge aggregation process behind these issues.

GRIP is an interdisciplinary institute focusing on globalization beyond its usual economic dimensions. GRIP builds upon the broad and diverse social science landscape inside the perimeter of the Université de Paris. Its interpretation of the concept of “Global Research” bounds to a vast array of research units scattered over Paris and relates to the broad landscape of social sciences in France. The multilingual publishing practices toward diverse audiences in social science research becomes a privileged standpoint of analysis (Kulczycki et al., 2020).

The differences in language use in science have long been raised (Garfield, 1990). However, rising technical possibilities are making it now possible to develop a discussion of this kind empirically. As international indexes limit the inclusion of many sources published in languages other than English, alternative languages can become a suitable option to understand alternative approaches to similar subjects. Studies on other European countries show that documents not indexed in citation indexes are, indeed, a valuable and often untapped source of information (Chi, 2014).

Languages that include countries with larger scientific communities can support debates somehow detached from mainstream preoccupations, a particularly notable feature, since different languages imply distinct circulation channels for published texts. Nevertheless, it has been stressed that English language is an important vehicle for international recognition (Meneghini & Packer, 2007). Research has shown that, contrary to the common assumptions about the globalization process, national research landscapes have strengthened their collaboration since the 2000’ (Maisonobe et al., 2016). In matters of Global Research, such as those behind GRIP’s interpretation of the subject, international acceptance goes beyond prestige; it is also necessary to establish a truly global scientific dialogue and collaborations.

GRIP chose to concentrate on three thematic areas: Global Urbanities; Circulations; and Technologies, Markets and Vulnerabilities. These topics were chosen through exchanges among academics located in the University, taking into account their interests, research trajectories, and previous projects, but with no previous strategic analysis. It was thus interesting to see if the Institute’s topics were different or coherent with the available literature, by comparing them with topics appearing in the international ‘mainstream’ publications and those in French publication repositories.

Materials and methods

This article charts literature related to Global Research themes that inspired GRIP’s development in two different languages (French and English). To better explain the analytical operations conducted, a detailed account has been included in Fig. 1, numbering each of the followed steps. Specific queries were built to capture documents involving each of the Institute’s thematic areas. As it remains crucial to situate these thematic choices, our research focuses exclusively on French authors. To represent the GRIP’s scientific objects and interests, we shared our research queries with a sample of researchers involved in the Institute’s thematic areas. In this way, we focused on refining a precise query for each thematic area. After collecting and examining our first harvest of documents, the feedback from these same researchers became an input to fine-tune our queries.

Fig. 1
figure 1

Main operations conducted during the analysis

As a result, six different queries for information retrieval were written, one for each research axis in both of the chosen sources. For English documents, information was retrieved from the Web of Science database. Items written in French were accessed from the HAL-SHS collection using the API made available by the platform (CCSD, 2021). HAL-SHS is a mandatory repository for French researchers that has been poorly explored and remains highly relevant since regional databases are an essential source to study social sciences’ research output (Engels et al., 2012). In both cases, the fields we retrieved were the titles, abstracts, keywords, authors, and publication years. In WoS, the search strategy was based on the documents’ topics. In HAL-SHS, full texts were available and searched; we excluded doctoral thesis and other teaching-related documents. We refined and improved our search strategy by consecutive iterations after analyzing the results in a continuous dialog with the researchers, experts in the field. As a result, we established the following query lines for each language, as shown in “Appendix” (Table 1).

In this process, we gathered 1798 English documents from the Web of Science (WoS) and 4545 French documents from the HAL-SHS, both between 1980 and 2020. The query aimed at articles, books, book chapters, and research communications in both sources. We processed the two resulting collections with the help of the CorText platform using NLP techniques to obtain a fine-grain meaningful set of 750 terms describing the main semantic features of both collections. These lists of terms were manually curatedFootnote 1 to eliminate noise and shallow terms in both French and English.

A Period Detector script was used to capture the semantic structure. We analyzed the frequency distribution of significant terms over the years and computed the optimal partitions in subsequent periods. We selected only the words with a minimum frequency of eight, 376 terms in French and 351 in English. We are able to represent distances between the semantical composition of our data for every pair of different time steps in a scale from most-similar (0) to most-dissimilar (1). Periods were computed using the gap statistic method (Tibshirani et al., 2001). As a result, we could improve the stability of differences among subsequent years. In this way, we gain access to the underlying knowledge aggregation process reflected in these documents. This method allows grasping the inner coherence in each detected period, as the distance between all pairs of years included within each period has been computed and presented using the same scale (from 0 to 1).

Textual analysis is a mapping technique used to assess research landscapes and identify thematic domains (Barbier et al., 2012). In order to describe the topical structure of analyzed documents in French and English, we produced a co-occurrence network with the same terms used for period detection. We used the Louvain algorithm (Blondel et al., 2008), a bottom-up modularity-based clustering algorithm, to detect emerging semantic clusters. Given the not deterministic nature of this process, slight variations can happen in different computations. Nevertheless, given the size of the calculated networks, additional runs were computed, still reporting a consistent community structure with the ones here reported.

We interpreted these clusters as the thematic structure of the documents in our corpus. To reduce the complexity of the resulting network and optimize the modularity in its partition, we defined a resolution parameter (Lambiotte et al., 2014). For this optimization, we used an algorithm to define a parameter resolution value (Aynaud, 2020) of two (2) within a range from 0.1 up to 4.9.

Clusters were manually labeled according to their structure and contents, using the terms’ weight and centrality within the cluster as a benchmark. French clusters were labeled in English to make them more accessible to readers. Although French detected terms were kept in their original language in Fig. 5, they were translated to English by authors and included in the “Appendix” (Table 3).

We studied the overall structure of the clusters considering terms relevance in the context of the GRIP and its connection structure within the cluster. The resulting terms were treated taking into account criticism to co-word analysis and their relation towards the context of use (Leydesdorff & Hellsten, 2006). In the term detection phase, our NLP approach allowed us to grasp a more complex set of terms—i.e., noun phrases—that better describe expressions in a language-in-use setting, detecting lexical variations in a broader context. For clarity, we included a detailed description of terms composing each cluster and each topical domain's overall density to better understand how well connected the linkages inside each cluster are.

We compared the overall structure of the terms describing French and English corpus. We stressed during our analysis two structural features to characterize discussions undertaken by French authors when comparing both languages: first, noun phrases related to theoretical approaches and objects of study; second, those pointing towards the geographical scope of the research and the involved stakeholders.

Findings

Publication growth over time

This analysis describes the themes related to the scientific interests of the GRIP in French publications in two different languages. When looking at the total number of documents complying with the criteria shown in the “Appendix” (Table 1), we discovered that within these issues, as defined above, more documents were published before 2005 in French (393 documents) than in English (115 docs), and publications grew earlier and faster in French than in English (Fig. 2). When comparing both curves, there is a significant time lag between French and English documents. The volume of publications around these issues picked up in French between 2002 and 2003, but only six years later in English, between 2008 and 2009. The French language growth seems to feed English-language papers, but with a significant time lag. Finally, we observe that the number of published English-language documents stagnates from 2018 and on.

Fig. 2
figure 2

Number of documents representing the Institute’s thematic interests in English and French

Detected lexical period structure

The underlying semantic structure, summarized in Fig. 3,Footnote 2 allows us to understand how this growth unfolds. It charts how many terms used in one year resemble those from the other years in the series, representing this using a scale from 0 (green) to 1 (white). The higher the value, the more dissimilar those 2 years are. Dissimilarity is represented by whiter cells and similarity by darker blue-green cells. On each graphic’s upper right, detected most similar periods are represented by using the same color scale. The resemblance of the pairs of years within each detected period—i.e., inner coherence—is calculated and presented, resorting to the same scale (from 0 to 1). This analysis allows us to better understand how consistent and consolidated each period is.

Fig. 3
figure 3

Period detection and significant term correlation between years in French and English

In French, three short periods, between four and eight years of duration, are detected. Going from 1982 to 1990, the first one has a low inner coherence (0.88). The second (1991–1997) and the third (1998–2004) periods deal with a large variety of issues; hence they also show low inner coherence ratios (0.91 and 0.92, respectively). The fifth and last is the one that covers more time, from 2005 to 2020. It is also the one with documents that resemble the most to each other (0.65). Periods relate to each other, showing a steady knowledge aggregation process when compared to this last period.

In English, the similarity between the papers over the years is more irregular and weaker. The inner coherence in the first five periods serves as an example to this: 1980 to 1992 (0.85), 1993 to 1995 (0.98), 1996 to 1999 (0.92), 2000 to 2003 (0.91), and 2004 to 2007 (0.90). Here, detected periods are more—six in total—and also shorter. Shorter periods mean that topics remain stable for a shorter time. Also, the resemblance between periods is not so strong as the one observed in the French publications and the aggregation process, showing that some of the issues addressed do not recover over time. In general, the debate in English appears looser and more erratic. Also, it is important to stress that it is not till 2008 that the English debate starts showing stronger internal cohesion (0.71).

The feeding process observed in the publication growth over the years is consistent with the observed results. At the end of the studied period, the debate becomes better structured in French (0.65 against 0.71 in English), happening earlier in 2005 in the French debate and 2008 in English.

Overall thematic features

The most salient characteristics of documents published by French authors in English on topics related to GRIP’s interests appear summarized in Fig. 4, while a detailed list of the terms that constitute each thematic domain are summarized in Table 2, available in the “Appendix”. When applying our parametrization to the set of curated terms describing retrieved documents, we came up with six meaningful clusters representing main thematic domains. A first cluster (c1eng), labeled Global Cities & collaboration networks, focuses on global and world cities. Many perspectives rise in the subject, such as global value chains, innovation studies, or regional studies. Regarding key locations emerging from this analysis, only one city New York—and two countries—United Kingdom, South Africa—appears in our outline.

Fig. 4
figure 4

English terms co-occurrence network

A second cluster (c2eng) relates to energy consumption & policy implications. It centers mostly on financial, manufacturing, and trade (policy) implications of the current energy matrix. It appears oriented chiefly towards developed economies. Notoriously, the World Bank is a relevant stakeholder connected to this discussion. A third thematic domain (c3eng) links to corporate governance & sustainability performance. Here, we have detected close relations between corporate governance, corporate responsibility activities, and specific performance areas, such as chain management and business practices. With a particular emphasis on sustainability, its main focus is on emerging economies.

Most notably, science and technology issues arise concerning socio-historical approaches. Then, a specific set of preoccupations rises around science and technology & history of science (c4eng). It seems notoriously oriented towards the European Union and higher education institutions within its boundaries. Sanitary issues that emerge in public health & global health (c5eng) relate primarily to chronic diseases, reproductive health, and mental health problems. The World Health Organization (WHO) appears a relevant actor for these matters, as well as Social Movements. These actors tend to be mentioned separately, either on reproductive health (social movements), or public health (WHO).

Finally, sustainable development & natural resource management (c6eng) shows sustainable development concerns associated principally with land-related and agricultural issues. Local actors appear as the relevant stakeholders; West Africa is linked to food security and Latin America to rural development.

French literature on the same topics shows a different configuration. Here, seven clusters result from our analysis. A detailed list of the detected clusters and their composing terms is available in the “Appendix” (Table 3), while a graphic representation of this research landscape is available in Fig. 5.

Fig. 5
figure 5

French terms co-occurrence network

A first cluster (c1fr) stresses the interest of French authors in Africa when addressing global research issues. Under the umbrella of Africa & contemporary societies, Africa relates mainly to sanitary risks and waste management. Daily life descriptions in urban and rural settings characterize much of the research done in France about Africa. We did not identify any particular actorsin this cluster, but it relates to specific places in the continent, such as Sub-Sahara or Central Africa.

In production systems & sustainable development (c2fr), co-occurrent terms show a similar structure to c6eng and c3eng. For French authors, this domain shows detailed accounts of processes such as food production and commercialization within a global scope. Corporate Social Responsibility is present, but not as markedly as its English counterpart. No global actors are mentioned in this cluster; only mentions of the rural world, food systems, and global markets are worth stressing. They allow grasping the kind of orientation of French-published research on the matter.

Also related to food production, common agricultural policy & Price volatility (c3fr) are emergent relations between different terms describing European agricultural policies. The most relevant actor here is the European Union, but there are also minor mentions to West Africa and France.

European rural settlements from prehistory to the Middle Ages, and historical approaches of land use, are the central preoccupation in rural settlement & historical approaches (c4fr). A more narrow-sighted perspective appears in mid-term & urban areas (c5fr), stressing relations between urban policies, economic activities, and citizen rights in the city. Attention to the development of cities stresses here mid-term implications rather than deep historical roots. Local communities stand out as the most relevant actors. At the same time, various geographical settings emerge, such as Latin and North America, or specific places and countries such as Rio de Janeiro and Ivory Coast/Côte d’Ivoire.

A particular set of interests gathers here in new stages & new era (c6fr) in tight connection to understanding processes of change and revolution. It points toward relations between historical time—Middle Ages, XVII, XVIII, and XIX centuries—and future scenarios—XXI century. It focuses on objects such as the French language, urban space, tourism, scientific research, and the educational system. No relevant stakeholders worth mentioning are detected here.

Finally, social space & humanities and social sciences (c7fr) arises as perhaps the most diverse clusters in the French-published documents. It contains mostly a sociological set of interests, building strong bonds between different descriptions of social structure. It uses primarily qualitative approaches towards culture, political order, new digital technologies, and discourses. It also shows a focus mainly on a local scale and an urban setting. Urban related research emerges related to small and medium-sized towns, without any particular city being mentioned. New technologies appear as a relevant subject in relation to the social transformations. Regarding geographical entities, France and Southeast Asia are the main locations for this kind of research.

Comparison between languages

We observed comparatively the thematic domains resulting from queries that represent each thematic interest of the GRIP in English and French to identify convergence and divergence in the two bodies of literature. The most notable shared interest is sustainable development, which relates in French with production systems (c2fr) while in English, it relates to three different clusters focusing on energy consumption (c2eng), natural resources (c6eng), and corporate governance (c3eng). S Similarly, science and technology research is a self-standing interest in English (c4eng) while it is subsumed in a larger cluster in French (c7fr). Interest on Global Cities is also clearly identified in English (c1eng) while in French it relates to the broader concept of social space (c7fr). English publications are neatly divided into specialized clusters, when in French the same concerns are embedded into larger conceptual categories.

Our semi-automatic approach to text analysis has allowed us to observe a rather holistic approach towards Global Research in the French language. Thematics such as social space (c7fr) or sustainable development (c2fr) do not appear as singular objects but rather appear connected to a diverse collection of research objects within the cluster, indicating a tighter linguistic and semantic connections with the French corpus.

Descriptors of French documents show two other traits worth mentioning: the predominance of historical approaches, and the preeminence of Francophone countries and Latin America as research locations.

Finally, a clear difference between the two language bodies of references is the scale of the mentioned objects. For instance, when discussing Global Cities in English (c1eng), New York rises as the only relevant city, whereas the focus in French is on small and middle-sized towns. Something similar happens with actors mentioned in the clusters: the English-written literature mentions large international governance bodies (World Health Organization, World Bank, United Nations) while French documents emphasize local communities and regional actors. Also, no specific actors arise as relevant term in French, but are referred to within a larger concept such as the rural world or food systems.

Discussion

Results have shown how similar topics can be addressed at different paces and with distinct thematic orientations in two different languages. As the volume of documents rises first in French and, later, in English, it is natural to assume that many of the ideas developed in one language get translated and adapted to the other. Our results on the semantic structure of the debate in each language confirm this “natural” pattern: the production in French has a tighter, firmer, and stronger underlying structure. Topics appear on a longer time frame, a fact we could relate to long prolonged and sustained tradition on these subjects with a steady amount of documents published in the first of the three periods. The last detected period of 15 years (2005–2020) speaks of the stabilization of the researchers’ main concerns as reflected in the French documents. The delay between French and English-language publications on the same topics may be explained by a publication strategy where publishing English-language articles or books takes place after a maturation period that is visible in the previously published French-speaking literature. It has been suggested that French journals might be less selective than English-speaking journals. We doubt very much this assumption—and given the growth of number of journals worldwide in English, including the so-called predatory journals—we tend to believe that French-speaking journals, in particular in the social sciences are as selective as English-speaking.Footnote 3 But the cost of access for a French scientists in a French-speaking journal is lesser, simply because of language issues. Moreover the geographical analysis of global science collaborations through co-authorships shows that inter-institutional collaborations at the national level have strengthened (Maisonobe et al., 2016). Thus the preference of French as first publishing language, followed by English as a second choice corresponds to a strategy: matters of interest get first widely discussed in French and then published in English.

We could then use the metaphor of ‘carry publishing’ (as we talk of carry trading in the financial investment business) since we can consider the articles in French as a local investment maturing over time to benefit in the long term by a publication in English on the international arena. French science publications in all domains, including social sciences, have grown all over France, as result of active policies followed in the last 40 years to strengthen smaller cities, local universities, regional research centers rather than the larger metropolis (Milard & Grossetti, 2019). These new research units tend also to promote “international” publications, mainly in English, which correspond to the suggested ‘carry publications’, from a growing research population scattered in a wider range of cities and regions. Finally, in the promotion and research evaluation processes in France in the social sciences, the contention about English-language journals has been very intense, to the point that the national research evaluation authority (HCERES), as well as several evaluation commissions in national research organizations, have abandoned the use of pre-established lists of journals in several social sciences, or the use of bibliometric indicators.

That entails implications in a better understanding of knowledge circulation dynamics and language interaction between English and non-English speaking scientific literatures (Kulczycki et al., 2018). This process takes place over the long term (Gordin, 2015) and relates to both scientific strategic choices and structural social constraints in the production of scientific knowledge (Hanafi & Arvanitis, 2014). The flat slope in publication rate from 2018 and on allows us to assume that these ‘carry publishing’ strategies have limits. Here, novelty appears as an inherent solid constraint since international interest will decrease over time.

When looking at the thematic orientation in French documents against those published in English, it becomes clear that these publishing practices have practical implications for the kind of research published in each language. The investment of knowledge and capabilities accumulated in French is not directly translated to English. Authors make changes when they present their research to an international audience reading in English and do not simply translate their work into English. Different objects are addressed at different scales in English, establishing a particular set of connections towards specialization rather than comprehensive approaches, as seen in French.

Regarding our data sources, some limitations are worth being mentioned. We have resorted to a novel data source such as HAL-SHS for the French part of our analysis. Although this is a rapid-evolving open science initiative, many normalization tasks are still lacking. These restrictions limit the possibilities of running traditional bibliometric analysis based on journals, authors, or institutions, as this information is not yet available or complete for records within HAL-SHS.

Further research should investigate these publication practices for non-English speakers; the particular case of French is of specific interest since it goes well beyond France (e.g., French-speaking Africa and North-African countries publish more in French than French scientists). How does the authorship structure develop in each language? What role authors’ choices play in this process? How do research institutions, evaluation standards, and international competition (as reflected in universities’ rankings) impact the publication process?

Conclusions

This research has shown that, on issues of global relevance, language matters. Specific interests can develop in a particular way within specific linguistic and geographic boundaries. The locally developed perspectives in French eventually migrate to the English language although serving different purposes, and not as simple translations from an original French-written version. Our results indicate a maturation process of thematic choices over time and different relative positions of these thematics when migrating from a local audience to an international English-speaking readership. This confirms previous observations showing that publishing in the English language in core-publishing journals confers a different value to academic knowledge than when it is published in the “local” language (Hanafi, 2011; Keim, 2016). Our case of French-speaking social sciences in a large University might convey a general pattern, but we cannot generalize to all social sciences in France, or in French-speaking countries, nor to other languages that are even more minority languages. To our knowledge, beyond observing that social sciences publish more heavily in their “national” languages than other scientific domains, little research has been done on this fundamental issue of knowledge circulation in different languages and its impact on thematic choices (Gordin, 2015; Kulczycki et al., 2018, 2020; Ortiz, 2009).

Additionally, we suggest an empirical method that allows observing this circulation process from conceptualization in the local context in French and further dissemination through English-speaking publications. A strong interaction along the analytical process is needed to obtain fine-grained results and build query lines in the data-mining process. The interaction with researchers to build strategic-oriented query lines for information retrieval has shown a promising potential to map rising strategic research landscapes. Nevertheless, incorporating feedback can be complicated in the subsequent stages of the analysis. Since informants may not be fully aware of these language-processing techniques, they might have difficulties providing precise feedback when examining a resulting semantic map.

Lastly, the local perspective in our analysis allowed us to link the scientific debates at the national level with those appearing in English-speaking journals. It has been possible in the case of France, since local data were available, and because French social sciences also circulate beyond the national boundaries (see, e.g., ‘French theory’ in the USA). But as national repositories and databases are growing everywhere (see the databases such as Latindex in Latin America, or the new repositories in Asia, China, Russia, India), studying differences between locally published research in non-English speaking contexts and English-speaking international authors will be feasible and will probably reveal specific determinants that go beyond the need to diffuse more widely research results.