Each network builder in bibnets
(author_network(), keyword_network(),
reference_network(), document_network(),
source_network(), country_network(),
institution_network(), conetwork()) requires a
data frame with a fixed set of columns and a small number of
list-columns holding multi-valued fields. The readers convert
source-specific exports (Scopus CSV, Web of Science plaintext, OpenAlex
flat or nested, Dimensions, Lens.org, BibTeX, RIS, Crossref) into this
common representation.
The standard schema returned by every reader is:
| Column | Type | Meaning |
|---|---|---|
id |
chr | Document identifier (EID, OpenAlex W-ID, DOI, etc.) |
title |
chr | Document title |
year |
int | Publication year |
journal |
chr | Source / journal / venue name |
doi |
chr | DOI without the https://doi.org/ prefix |
cited_by_count |
int | Citations received (as reported by source) |
abstract |
chr | Abstract text; NA for sources that do not expose
it |
type |
chr | Document type (article, review, book-chapter, …) |
authors |
list | Character vector of author names per row |
references |
list | Character vector of cited references per row |
keywords |
list | Character vector of keywords per row |
Source-specific extras (e.g. index_keywords,
keywords_plus, affiliations,
countries, language) follow the standard
columns. The contract that downstream functions such as
build_bipartite() and author_network() rely on
is that each multi-valued field is stored as a list-column whose
elements are character vectors.
This vignette covers all nine readers, the read_biblio()
entry point, the generic-CSV path, the split_field()
helper, and the manual construction of bibnets-compatible data
frames.
read_biblio()read_biblio() accepts a single file, a vector of file
paths, or a directory. When format = "auto" (the default)
it detects the format from the contents of the file:
data <- read_biblio("export.csv") # auto-detect format
data <- read_biblio("scopus_dir/") # entire directory, rbind'd
data <- read_biblio(c("a.csv", "b.csv")) # multiple files, rbind'd
data <- read_biblio("file.csv", format = "scopus") # force a formatWhen given a directory, read_biblio() collects every
.csv, .txt, .bib,
.ris, .xls, and .xlsx file in it,
reads each one, and combines the results with rbind(). For
more than one file a summary message is emitted:
Read 3 files: 1247 rows total
Format detection is performed on the first non-empty line of the file:
@TY -FN or
PT"About the data: ..."),
line 2 is used instead. Header tokens determine the format:
eid for Scopus, lens id for Lens.org,
publication id or dimensions url for
Dimensions, authorships.author.display_name for the
OpenAlex flat CSV.If detection fails, read_biblio() raises an error that
lists the supported formats and indicates how to pass
format explicitly or use format = "generic"
with actors.
Two readers are not dispatched by read_biblio():
read_openalex() accepts an in-memory tibble from
openalexR::oa_fetch(), not a file path.read_crossref() accepts the data element
of rcrossref::cr_works().Both take R objects rather than files and are called directly.
The package includes a 30-row OpenAlex flat CSV at
inst/extdata/openalex_works.csv, corresponding to the
export produced by downloading “Works” results from the OpenAlex web
interface. Multi-valued fields use | as the delimiter.
f <- system.file("extdata", "openalex_works.csv", package = "bibnets")
oa <- read_openalex_csv(f)
str(oa, max.level = 1)
#> 'data.frame': 30 obs. of 13 variables:
#> $ id : chr "W2769342982" "W2264893711" "W2612059685" "W3118164373" ...
#> $ title : chr "Open University Learning Analytics dataset" "Educational Data Mining and Learning Analytics in Programming" "Predicting Student Performance using Advanced Learning Analytics" "Predicting Student Performance Using Data Mining and Learning Analytics Techniques: A Systematic Literature Review" ...
#> $ year : int 2017 2015 2017 2020 2022 2016 2020 2024 2016 2020 ...
#> $ journal : chr "Scientific Data" "" "" "Applied Sciences" ...
#> $ doi : chr "10.1038/sdata.2017.171" "10.1145/2858796.2858798" "10.1145/3041021.3054164" "10.3390/app11010237" ...
#> $ cited_by_count: int 432 312 235 417 247 163 122 133 131 177 ...
#> $ abstract : chr NA NA NA NA ...
#> $ type : chr "article" "article" "article" "article" ...
#> $ authors :List of 30
#> $ references :List of 30
#> $ keywords :List of 30
#> $ affiliations :List of 30
#> $ countries :List of 30
head(oa[, c("id", "title", "year", "journal", "type")], 5)
#> id
#> 1 W2769342982
#> 2 W2264893711
#> 3 W2612059685
#> 4 W3118164373
#> 5 W4300484403
#> title
#> 1 Open University Learning Analytics dataset
#> 2 Educational Data Mining and Learning Analytics in Programming
#> 3 Predicting Student Performance using Advanced Learning Analytics
#> 4 Predicting Student Performance Using Data Mining and Learning Analytics Techniques: A Systematic Literature Review
#> 5 Artificial Intelligence and Learning Analytics in Teacher Education: A Systematic Review
#> year journal type
#> 1 2017 Scientific Data article
#> 2 2015 article
#> 3 2017 article
#> 4 2020 Applied Sciences article
#> 5 2022 Education Sciences reviewThe list-columns:
oa$authors[[1]]
#> [1] "Jakub Kužílek" "Martin Hlosta" "Zdeněk Zdráhal"
oa$affiliations[[1]]
#> [1] "The Open University"
#> [2] "Czech Technical University in Prague"
#> [3] "The Open University"
#> [4] "The Open University"
#> [5] "Czech Technical University in Prague"
oa$countries[[1]]
#> [1] "CZ" "GB" "GB" "CZ" "GB"Two limitations of the flat export should be noted:
all(vapply(oa$references, function(x) length(x) == 0 || all(is.na(x)), logical(1)))
#> [1] TRUE
all(is.na(oa$abstract))
#> [1] TRUEreferences is empty and abstract is
NA because these fields are not included in the OpenAlex
web download. Cited references and abstracts from OpenAlex are obtained
via openalexR::oa_fetch() and processed with
read_openalex() (described in section 6).
The remaining fields support several network constructions that do not require references — co-authorship, country, institution, keyword, source, and document networks:
read_scopus() ingests the standard Scopus CSV export
(File -> Export -> CSV from the Scopus search UI).
Mappings from Scopus columns to the bibnets schema:
| Scopus column | Standard column |
|---|---|
EID (or Article No.) |
id |
Title |
title |
Year |
year |
Source title |
journal |
DOI |
doi (prefix stripped) |
Cited by |
cited_by_count |
Abstract |
abstract |
Document Type |
type |
Authors (;-delimited) |
authors (list) |
References (;-delimited) |
references (list) |
Author Keywords (;-delimited) |
keywords (list) |
Index Keywords (;-delimited) |
index_keywords (list, extra) |
Affiliations (;-delimited) |
affiliations (list, extra) |
Language of Original Document |
language (extra) |
Scopus stores each cited reference as one semicolon-delimited string
in a single cell. read_scopus() splits on ;
and applies standardize_refs() to each entry: uppercasing,
whitespace normalisation, and removal of a trailing DOI where present.
References differing only in case or trailing DOI then resolve to the
same node in co-citation and reference networks.
WoS exports come in two shapes:
wos1 <- read_wos("savedrecs.txt") # plaintext (default)
wos2 <- read_wos("savedrecs.tsv", format = "tab") # tab-delimitedThe plaintext format is a tagged record syntax. Each record begins
with a PT (publication type) tag and ends with
ER (end record). Within the record, every field is
introduced by a 2-letter tag at the start of a line, with continuation
lines indented:
| Tag | Field |
|---|---|
AU |
Authors (one per line) |
TI |
Title |
SO |
Source / journal |
PY |
Year |
DI |
DOI |
TC |
Times cited |
AB |
Abstract |
DT |
Document type |
DE |
Author keywords |
ID |
Keywords plus (extra: keywords_plus) |
CR |
Cited references (one per line) |
read_wos() walks the file, splitting on ER
boundaries, and emits one row per record. The tab-delimited variant
carries the same fields in a flat CSV-like grid. Either way the output
schema is identical.
OpenAlex ships data through two routes that bibnets supports separately.
openalexRThis path is used when references and abstracts are required.
openalexR::oa_fetch() returns a nested tibble with
author, referenced_works,
concepts, and keywords list-columns;
read_openalex() converts it to the standard schema:
library(openalexR)
raw <- oa_fetch(entity = "works", search = "learning analytics", per_page = 200)
data <- read_openalex(raw)References are returned as OpenAlex Work IDs
(e.g. W2769342982) rather than formatted citation strings.
The IDs are stable identifiers suitable for co-citation and
direct-citation networks; visualisations that need human-readable labels
can join the IDs back to titles in a separate step.
The read_openalex_csv() reader, demonstrated in section
3, applies to the file format produced by the OpenAlex web interface.
References and abstracts are not present in this format.
The Dimensions CSV begins with a metadata row of the form
"About the data: This export was generated on YYYY-MM-DD ..."
before the column header. read_dimensions() detects this
preamble and skips it. If the line has been removed (for example, by
manual editing of the file), the reader continues to function because it
identifies the column row by the Dimensions header tokens
Publication ID and Dimensions URL.
Extras returned: affiliations and countries
as list-columns, analogous to the OpenAlex schema.
Key Lens columns and how they map:
| Lens column | Standard column |
|---|---|
Lens ID |
id |
Title |
title |
Publication Year |
year |
Source Title |
journal |
DOI |
doi |
Cited by Count |
cited_by_count |
Abstract |
abstract |
Publication Type |
type |
Author/s |
authors (list) |
Reference Identifiers |
references (list) |
Keywords |
keywords (list) |
read_bibtex() parses
@type{key, field = {value}, ...} blocks.
read_ris() parses tagged TY - ... ER -
blocks; the structure is equivalent to WoS plaintext, but with a
different tag dictionary.
Standard BibTeX and RIS do not contain cited-reference data, so the
references column in the resulting data frame is empty on
every row. These formats are sufficient for co-authorship and keyword
co-occurrence networks. For co-citation, coupling, or direct citation
networks, the appropriate sources are Scopus, Web of Science, OpenAlex
(via oa_fetch()), Dimensions, Lens, or Crossref.
library(rcrossref)
raw <- cr_works(query = "graph neural networks", limit = 100)
data <- read_crossref(raw$data)read_crossref() accepts the data element of
the cr_works() result (a data frame, not the wrapping
list). The function handles the two field-naming variants Crossref
returns (container.title vs container-title;
is.referenced.by.count vs
is-referenced-by-count) and maps both to the standard
schema.
read_biblio(format = "generic", ...)For CSV files that do not match any of the recognised signatures
(in-house exports, custom dumps, public datasets), the generic path
provides explicit column-name mapping. The identifier column is named
via id; columns to be treated as list-columns are named via
actors. sep is the delimiter applied inside
those cells.
Hypothetical call:
data <- read_biblio(
"my_data.csv",
format = "generic",
id = "doc_id",
actors = c("Authors", "Keywords"),
sep = ";"
)Demonstrated on the bundled OpenAlex CSV (which uses |
as the delimiter):
f <- system.file("extdata", "openalex_works.csv", package = "bibnets")
generic <- read_biblio(
f,
format = "generic",
id = "id",
actors = c("authorships.author.display_name", "primary_topic.display_name"),
sep = "|"
)
names(generic)[1:6]
#> [1] "id" "apc_list.value"
#> [3] "apc_paid.value" "authorships.author.display_name"
#> [5] "authorships.author.id" "authorships.author.orcid"
generic$authorships.author.display_name[[1]]
#> [1] "Jakub Kužílek" "Martin Hlosta" "Zdeněk Zdráhal"The named id column is copied to a top-level
id. Each column listed in actors is split on
sep and stored as a list-column. Other columns are retained
unchanged. The resulting frame is therefore not in the standard
schema; it is a wider source-specific table. Network constructors can
either be pointed at the relevant columns directly, or the frame can be
post-processed into the standard schema.
When data does not come from any of the supported sources, a bibnets-compatible data frame can be constructed directly. The requirement is: standard scalar columns are character or integer; multi-valued fields are list-columns whose elements are character vectors.
df <- data.frame(
id = c("p1", "p2", "p3"),
title = c("Paper A", "Paper B", "Paper C"),
year = c(2020L, 2021L, 2022L),
stringsAsFactors = FALSE
)
df$authors <- list(
c("ALICE", "BOB"),
c("BOB", "CAROL"),
c("ALICE", "CAROL", "DAVE")
)
df$references <- list(
c("R1", "R2"),
c("R1", "R3"),
c("R2", "R3", "R4")
)
df$keywords <- list(
c("graph", "network"),
c("network", "embedding"),
c("graph", "embedding", "neural")
)
author_network(df, "collaboration")
#> # bibnets network: author_collaboration | 4 nodes · 5 edges | counting: full
#> from to weight count
#> 1 ALICE BOB 1 1
#> 2 ALICE CAROL 1 1
#> 3 BOB CAROL 1 1
#> 4 ALICE DAVE 1 1
#> 5 CAROL DAVE 1 1
keyword_network(df)
#> # bibnets network: keyword_co_occurrence | 4 nodes · 5 edges | counting: full
#> from to weight count
#> 1 EMBEDDING GRAPH 1 1
#> 2 EMBEDDING NETWORK 1 1
#> 3 GRAPH NETWORK 1 1
#> 4 EMBEDDING NEURAL 1 1
#> 5 GRAPH NEURAL 1 1
reference_network(df)
#> # bibnets network: reference_co_citation | 4 nodes · 5 edges | counting: full
#> from to weight count
#> 1 R1 R2 1 1
#> 2 R1 R3 1 1
#> 3 R2 R3 1 1
#> 4 R2 R4 1 1
#> 5 R3 R4 1 1build_bipartite() applies
toupper(trimws(...)) to every entity label before
constructing the sparse matrix, so "graph",
"Graph", and "GRAPH" are mapped to the same
node "GRAPH". Tests or comparisons that reference node
names should use uppercase strings.
split_field() helpersplit_field() converts a character column with
semicolon-delimited (or otherwise delimited) values into a list-column
without going through read_biblio(format = "generic"):
split_field(c("Alice; Bob; Carol", "Dave; Eve"))
#> [[1]]
#> [1] "Alice" "Bob" "Carol"
#>
#> [[2]]
#> [1] "Dave" "Eve"
split_field(c("a|b|c", "d|e"), sep = "|")
#> [[1]]
#> [1] "a" "b" "c"
#>
#> [[2]]
#> [1] "d" "e"This is the same operation that read_scopus() and the
other readers apply internally to multi-valued columns; it is exported
for use in custom pipelines.
Different readers expose different extras: WoS provides
keywords_plus, Scopus provides index_keywords,
OpenAlex provides countries. To combine sources, restrict
each frame to the standard columns and bind:
common <- c("id", "title", "year", "journal", "doi", "cited_by_count",
"abstract", "type", "authors", "references", "keywords")
data(biblio_data)
b1 <- biblio_data
b2 <- biblio_data
b2$id <- paste0(b2$id, "_dup")
cols <- intersect(common, names(b1))
combined <- rbind(b1[, cols], b2[, cols])
nrow(combined)
#> [1] 20Two practical notes:
keywords_plus) should
be retained on the per-source frame and merged selectively rather than
coerced into the combined frame.After reading, basic checks on the list-column sizes and the scalar columns help detect silent corruption. Empty list-columns and out-of-range years are common indicators that an export is incomplete.
data(scopus_quantum_cloud)
sc <- scopus_quantum_cloud
range(lengths(sc$authors))
#> [1] 0 40
range(lengths(sc$references))
#> [1] 0 245
range(lengths(sc$keywords))
#> [1] 0 20
head(sort(table(sc$journal), decreasing = TRUE), 5)
#>
#> IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
#> 24
#> IEEE Transactions on Circuits and Systems I: Regular Papers
#> 20
#> IEEE Access
#> 18
#> IEEE Transactions on Very Large Scale Integration (VLSI) Systems
#> 14
#> IEEE Internet of Things Journal
#> 12
range(sc$year, na.rm = TRUE)
#> [1] 2020 2025
table(sc$type)
#>
#> Article Book Book chapter Conference paper
#> 279 1 15 191
#> Conference review Review
#> 3 10Indicators to check:
lengths() of 0 on every row of
references for a Scopus or WoS file indicates that the
export did not include the references column. Re-export from the source
with the references field selected.0 or NA indicates an empty
source field."article")
is expected for filtered searches; broader mixes are expected for
thematic searches.| Symptom | Cause | Fix |
|---|---|---|
Could not detect file format |
First line doesn’t match any signature | Pass format = "scopus" (etc.) explicitly, or use
format = "generic" with actors |
Empty references list on every row |
BibTeX/RIS or OpenAlex flat CSV — these don’t carry citations | Use Scopus/WoS, OpenAlex via oa_fetch(), Dimensions,
Lens, or Crossref |
Invalid multibyte string on read |
Wrong encoding | Most readers accept encoding = "latin1"; pass it
through read_biblio(..., encoding = "latin1") |
Author names look like LASTNAME, F.J. not
FJ LASTNAME |
Default is flip_names = FALSE |
The reader returns names as-is from the source. Cluster them by
string match downstream, or pass flip_names = TRUE if all
names follow Last, First |
| Dimensions file silently fails | “About the data” preamble removed and column header edited | read_dimensions() detects the standard preamble and
falls back to header-token detection; the failure mode requires the
column header itself to have been edited |
Co-authorship network contains duplicate nodes
(e.g. "Alice" and "ALICE") |
Mixed casing in the source | The standard readers and build_bipartite() apply
toupper(trimws(...)) to entity labels. Manually constructed
frames should apply the same normalisation |
vignette("bibnets"), covers
network construction on the in-package datasets.author_network(),
keyword_network(), reference_network(),
document_network(), source_network(),
country_network(), institution_network(),
conetwork()) accept the same set of arguments
(type, counting, similarity,
threshold, top_n, format), so
switching between network types on data already in the standard schema
requires only a function-name change.