We are fresh from coffee and Philip Hunter has just introduced our first speaker in this session who is coming in live from Skype:
Thomas Krichel (Long Island University) – AuthorClaim (via Skype)
My co-author here is ? Wolfram is the Chrief Informaion Officer for Scholarly Information at Bielefeld University, they have run the BASE search engine since 2004. It’s not really attached to any one funded project but is a long run concern. I too am interested in running things over the long term. I run RePEc and have been involved in repositories since the early 1990s.
The motivation is to make (economics) papers freely available – the full text of those papers. To make information about the papers freely available, And to have self-sustaining infrastructure for these materials.
RePEc is misunderstood as a repository, actually it is a collection of around 1300 institutional (subject) repositories from libraries and research centres with specialist collections. It predates OAI, it is a reduced business model, more tightly interoperable. There are lots of sources of success. The business case is decentralised as much as possible, it runs on volunteer power, and RePEc encourages the reuse of RePEc data – we aggressively push out the data we have collected as we think this is in the best interest to those who have set up these repositories.
The RePEc technical case:
RePEc registers authors with the RePEc Author Service (RAS). We register institutions. And provide evaluative data for authors and institutions. So what is the relationship with repositories? Well it’s a bibliographic layer over repositories. IRs can/will benefit from a similar layer around them – a free bibliographic layer that places the IR in the wider context.The requiement for such a layer is that it is not dependent on external funding, it’s freely reusable instantaniously, and must be there for the long run.
A RePEc for all disciplines:
- RePEc biblipgraphic data -> 3 lib
- RePEc Author Service -> AuthorClaim
- EDIRC -> ARIW – I won’t talk about this, it’s a topic for another day.
3lib is an initial attempt at building an aggregate of freely available bibliographic data, project by OLS sponsored by the Open Knowledge Foundation. The data elements are very simple as it is designed to not meet copyright issues and primarily be for author claiming: title; author name expressions; link to item page on provider site; identifier. 3lib is meant to serve AuthorClaim.
AuthorClaim is an authorship claiming service for 3lib data – http://authorclaim.org/. Started the first author claiming system for RePEc in 1999. The system was set up by me and the system was written by Markus J. R. Klink. Author claiming is not the same thing as author identification. The difference is “Klink’s Problem”. The actual AuthorClaim data is CC0 licencsed and available as xml for reuse. The data on refused papers help the system to build learning models for author names.
IRs and author identification. Generally it’s too large to perform author identification for IRs. IRs are too small to make it meaningful for authors to clain papers in them directly though. Only registration of contributors is usually required. ORCID offers possibilities here but doing it for each publisher isn’t perfect. AthorClaim lets you put all papers by an author together and the task can be completely automated once an AuthorClaim record claims a paper in the IR. You have an incentive for people to actually claim their papers at first.
We have formed a partnership with BASE as they already have a centralized collection and can deliver the AuthorClaim data. Their constant monitoring of OAI-PMH world and they normalize data and provide an API, REST, SOAP and rsync for AuthorClaim. So the BASE data in AuthorClaim is selected by those which include author, title, link, identifier. AuthorClaim discards some IRs that contain student work, digitized old material, link collections, primary research data – though in principal it could be extended to data etc. There are also some minor manual exclusions (e.g. UK PubMed Central as already in PubMed).
So far there are 1930 repositories and about 12 million records. About 534 records have been claimed in the system. Documentation is at: http://wotan/liu.edu/base/ – beware that this needs a little debugging. The collection is not yet announced because it is being read – some more time needed.
For more information contact myself (krichel@openlib.org) or Wolfram (whorstma@uni-bielefeld.de).
Q&A
Q1 – Peter Murray Rust) I congratulate you on what you’ve done. The key thing for repositories is to create this bibliographic overlay. It’s impossible to search repositories in the UK at present. Have all 1900 repositories been done by you – the analysis of the API etc. – or have you farmed out to volunteers?
A1) I’m not providing search services for repositories. I am working on a search service for authors. That’s a project called Author Profile (I spoke about this in Boston in June) – searching for authors, bringing their work together. I’m not doing searching at this time. We do have Google but we need element in repositories to be more available to search engines. The PageRank requires a more linked world – we need to bring in more links to items in repositories – an author profile used elsewhere will create in-bound links to the repository. These links will help the document to rise up in the search engine. So I’m not doing search particularly at this time. But we all need to work on different things. I’ll probably be doing this until the end of my working life but others will be working on search. We just all need to work together.
And with that Thomas is off to the (Siberian) beach! Next up…
Mo McRoberts (BBC Data Analyst) – BBC Digital Public Space project
I work on the BBC’s Digital Public Space Project. Three things you should know about the BBC:
- We like to do things big
- we like things where we have no idea what will come out of it
- We like silly names!
We are looking at ways to make the best use of the BBC archive. And we are trying to find how we should fit into the digital world. Last year we published the “Putting Quality First” BBC Strategy Review (http://bbc.in/strategyreview). That review said we should open the archive to the public and to work with the British Library, BFI and Arts Council England to bring other public archives to a wider audience. My job is to see if this is technically possible and then how this could be done. This review went to the trust last year and the BBC Trust has approved the move to make the archive open to the public. So we have to but we don’t know when and we don’t know how – hence this project.
The BBC Archivehas 2.3m hours of film and video, 300k hours of audio, 4m photographs, 20k rolls of microfilm – it took us 2 years to find out the scale of it! There is also sheet music, ridiculous amounts of materials. A bit of it is digitised – 206k digitised radio programmes, 23k digitised tv programmes and an ongoing project to digitise this all – effectively a digital tape library. The underpinning mantra of this project is how do we maximise the value of this stuff?
A lot of the things we need to do here is not only important for the BBC but also to other archives of cultural heritage. Is YouTube part of the cultural heritage – that skateboarding cat might be a really important moment – but for now we are focusing on the well known institutions like BFI, Kew, NLS, LLGC NLW, National Archives, National Maritime Museum, Britial Library, Royal Opera House. So we thought why don’t we link these collections together. So we have been looking at how to make those journeys between materials work well. We don’t have long term internal funding but we are working in partnership and if we can demonstrate the potential of working then it could become something big and cool and useful. Right now it’s a tiny little thing that we hope will become big in the future.
Right now the technical bit!
All institutions contain catalogues of stuff best suited to archivists. Some points to physical assets, some to digital assets, some do not point to assets at all. We all deal with our data in very different ways. If we could express what we do in a common way, a way which allowed links between things and the assets, using a well known grammar, then we could probably do something quite interesting with that. So we are taking dumps of the participating institutions – there was no particular selection process by the way, the ones who gave us data fast are in – we are publishing RDF XML on a private server for each institution. That data pushes into a central aggregator. We make use of a single golden rule: “give everything a single permanent URI, and make the data about that thing accessile at that URI” (or rather you give your assertions about things a URI).
The aggregation is evaluated via straightforward logical process – are two things the same? – but also some heuristic stuff there – we build a full text index to mine and evaulate new material against. We use scoring of that evaluation to decide what is and is not the same. We also match the things to external sources – DBPedia Lite, GeoNames, FreeBase etc. We create a stub object. We are opening the archive to normal people – we rearrange the catalogues as they come in a bit. We break items into thing, person, place, event or collection. The stub object has a type (e.g. Person) and relationships to things it’s matched to (e.g. George Orwell). We deal with real world things rather than individual entries in the catalogue. We express relationships between Stubs and Source entities as Skos:exactMatch (or non exact matches). We also take any references and reflect them. We call these stub objects as they are just a reflection of the evaluation process. It’s a hard design constraint that whatever data goes in, you should be able to get it out again verbatim. We don’t need lots of data attached to the stub – enough to do top level browsing and indexing – we leave everything else in the source objects – and then you can follow your nose – which is why we have cached this data. If internet connectivity was better we wouldn’t even need to cache these.
Exciting! An actual stub object for the Republic of Brazil. The key things are that it’s a place, it’s taken on the types of the source data, and it has some references to DBPedia Lite and some Source Data (BBC News on this Day – which I cheekily scraped!). And from that data we can build some interesting interfaces. Building a user interface on top of that data is a doddle. In order to get people to build stuff you need to be able to get them to browse that data so we are building this for all resources. We are also building something called “Forage” – a search driven debugging tool to see the raw data and the relationships. And then we have the Digital Public Space interface that we commissioned a firm to produce for us – we asked them to produce something a bit left of field. They have a lot of experience of video aggregation. You’d think at the BBC we’d have lots of AV material for all our entries. We will but it’s not that easier. Getting anything internally is far harder than getting it externally from project partners. This will change over time but things don’t move quickly. So this interface combines our data with this companies existing video aggregation data.
There are a few hard constraints that we are trying to keep to. We want to maintain the provenance of everything. If the data is preserved but technology has changed massively, that you will still be able to do useful things with it. So we are looking at things like digitally signing the source data as it comes in – challenging in RDF – and we want it to be open to allcomers as a read and write database. Ultimately we want all partners to provide their own data and just link it together but that’s a way off for now.
http://bbc.in/dpsblog – a blog post here by me gives further information on the project
Q&A
Q1) This is huge and awesome. Is there any chance of open sourcing the code?
A1) Yes, we will be open sourcing the code but we need to get to the end of this project, and we have some paperwork to do. We would like to open it up to the academic community within about 18 months – an actual running version. All of the metadata should be fine but how many of the assets will be open we are not quite sure. We are trying to find the right frameworks. The code should be open source in a fairly short space of time. As the author of it I have to say it’s not about to set the world alight.
Q2) Perhaps an unfair question. You’ve brought to our attention that the BBC defines a phenomenon of “The Public Space” and the “national interest”. This is a political move. In the sense that we are engaged in the same sort of activity and a public space rather than a private and owned space, how do our activities and do we start to recognise each other and work with each other…?
A2) It’s a difficult one. The edges are always fuzzy. We are getting better at it as well. We have been talking to the JISC and the OU in this project, also with the University of Westminster amongst others. We are not trying to draw a line in the sand about this being only arts and culture. We want this aimed and available to academics for research purposes. The BBC as an institution – I work in the Archive (part of BBC Vision) I also work with R&D and we like our research. We are very open to working with others. Perhaps the whole organisation doesn’t share that view now but it’s getting there. There is no choice but to engage with as many different interests as possible – for good and for bad. The academic community is a big and significant part of that though and that will only get bigger over time.
Ben Ryan (University of Leeds) – Timescapes Project
Timescapes is an ESRC funded project for 5 years looking at how family relationships change over time. I am the techncial officer for Timescapes and I’ll be talking about Timescapes Next Generation Archive – but we don’t have a second tranch of funding so we will deliver a proof of concept by the end of the project in early 2012.
We have been working with a product called Digital which is hosted by Leeds University. This platform sees all files as digital objects and doe snot allow modelling of complex structures of information and it’s inter-relationships. You can’t easily display connections and context around materials. We want to publish, archive and allow secondary research on data and that has huge challenges. We have been looking for solutions for social science longditudinal data storage and delivery.
We chose Fedora as it has a Content Model Architecture allowing the researcher to see connections and meaningful terms. And it allows multiple views onto the data. It allows the creation of content models – say we have an interview – is it anonimised? partly redacted? We have different levels of access to data so we need a flexible model that enables that. We also need to link data objects. Fedora allows us to link concepts, to set up our own relationships. It is all based on RDF triplestores and that is hugely powerful.
So our current archive shows the relationships between data on men as fathers (a particular study in the project), we can group material by interviewee, by waves of research, etc.
The services mentioned earlier are responsible for producing the views of relationships within the archive – these are built to suit the needs of the researcher. You can access whole groups of material or perhaps just case by case – both relying on who the viewer/reader is. We have flexibility there that allows us to differentiate between “types” of social science data such as DDI or QuDeX. You can’t just look at one object, we want to link internally, conceptually, thematically within the system.
SOLR is being used for searching and browsing – it’s off the shelf and easy to set up. It will look for data objects that have any of the search terms in pre-configured DISMAX metadata fields. We can set up custom searches really quickly for our researchers. We can also do advanced searchs and get these up and running fast. I am the only resource on this project so this has been a very fast way to build a nice system. We also use JQuery here. We have been using MIT SIMILE tools for faceted browsing and searching as well.
Another reason for Fedora is that is has XACML. It is crucial that we keep this data well protected, especially in it’s raw form. XACML lets us bring the policies from the repository right down to specific data object. Fedora manages this and that means we have a good reliability and audit trail around authentication and authorization.
So the systemm is based on three sources: DDI, QuDex, Timespaces. This is ingested, via an XSLT Transform, into Fedora via METS, We then connect up to multiple search and functionality elements and a PHP Web App that sits on the top.
Q&A
Q1) Can you explain a bit about the benefits that you’ve seen – you described the subject, predicate, object model. Often people only find that useful when you combine data with lots of other systems. Presumably for your work you could have had a relational database instead – could you outline why this was useful? Is there an intention that the ESRC’s other projects might benefit from this?
A1) It was mainly because it was in Fedora. We could define our own topologies. We use the flexibility of the RDF to do our structural stuff. We could move into combining that with other data but we haven’t yet. We are working closely with UKDA about the use of these technologies, there are very close relationships and connections there.
Yvonne Howard (Southampton Univ.) – Campus ROAR
I work with Pat McSweeney and Andy Day at ECS at University of Southampton. We were looking at learning materials and we looked at EdSpare, Humbox, Language Box etc. But we started thinking beyond these repositories at scholarly discourse. Where does Scholarly Discourse take place? It was once about scholars in a big room where everyone knew what was new. It was easy to follow the discourse. That 19th century form didn’t change much until the mid twentieth century, perhaps until the internet.
What is scholarly discourse now? It’s websites, online journals, social media locations. It’s not just a small group in the room but conferences all over the world. And yes, you get the article but a lot of what happens is ephemeral. When people talk about their research at a conference it’s gone. When you see these slides, tweets, blogs, it disappears. It’s not connected anymore, it’s not all in that one room. One thing we know is that there is a lifecycle going on. Us research get inspired, it’s a dynamic process and so is the research at the heart of that discourse. So how can we start to support that within a scholarly discourse idea.
Mostly we think of repositories as being about archiving, storing and keeping material safe and permanent. But what if we had a research repository that captured some of that discourse. We would want to archive but also the data, the discource, the scholar and their presence. How do you showcase interesting research. Well we can syndicate new research, we can showcase researchers. We want to make things engaging. So we hosted content as well as metadata, capturing discourse and commentary about it. And you have a community that highlights awareness. And we want to reuse what’s going on in the Web 2.0 world. We have new formats in place here – iPads and iPhones etc. We are extending the concept of the web/RSS feed – and we provide engaging magazine style produts. And this is based on content syndicated through RSS, Twitter etc.
But how do real repository users repond to the idea of using RSS. People see it as geeky. Take up of RSS from teaching and learning repositories was poor so we asked people why – it scares or seems unmanagable to users. But people seemed to like Twitter – what’s the difference? Well it’s easy to use and understand.
So Campus ROAR is an editorially mediated institutional publication – how do you make content available and capture that scholarly discourse – and tools for using that data. Cue a demo from Andy and Pat. We’ve been looking at making more digestible content form what we have in our repository. We have made an EPrints plugin (available in Bazaar for EPrints 3.3) that makes content customisable and digestible. You can build a custom feed for academic news in your area. A web spider crawls the university webspace, identifies the keywords, the user can input their own keywords and it outputs a custom feed – it filters content for you. See: http://panfeed.ecs.soton.ac.uk/
At the moment we have Edinburgh, Glasgow and Southampton campuses already crawled for today but we’re happy to add anyone here’s campus!
The feed is designed to look great on FlipPad on the iPad and in similar apps. We’ll be doing a Pecha Kucha in more detail as well! Go check out the website. The other part of Campus ROAR is the EPrints Publisher plugin.
Q1) Are you planning crawl more widely?
A1) At present we have 3 institutions included and we try to keep track of where the news is from. It’s brand brand new but we hope to be able to filter it down to specific campuses if you want to – for use by your comms team say. Worth noting that it takes time to crawl new universities so would take time to broaden out.