Library Data

Preprocessed/analysed some library data, to build an app upon later on:

Results:

  • Recommendations/related materials, based on similarity measure similar to cosine-similarity, but with slightly different weight for more relevant results.
  • Eigenvector analysis (truncated SVD) of materials/loans, which can be used for a fast “relatedness” distance between materials (and will also be interesting to explore, to see if it can be used for a computational/empirical aproach to “genres”)
  • Easy to use data model (triples a la LOD) from different source, – and loaded into CouchDB
  • Compressed local copy/backup of data

Tools:

  • python, leveldb, gensim, couchdb

Starting point:

  • Previousely created a similar recommendation calculation
  • Have taught eigenvector analysis etc., but never applied it to problems this large before

Takeways:

  • It is easy to run SVD on large datasets. (unlike last time I looked at it, 10 years ago). Gensim crushes large data suprisingly fast (200 main eigenvalues on a ca. 1Mx500K sparse matrix in <16GB and less than a night).
  • Mahout/spark is overkill+overhead when you do not need it nor have a cluster. (Installed Cloudera/CDH, spark etc. and got started, but then realised that it was easier just to implement it in python)

Notes

  • Just a quick hack, code not really reusable except for rerunning data extraction
  • Data model: object-property-value list with multivalue, represented as json: {"_id": "object-id", "prop1":["val1"], "prop2":["val2", "val3"]...} loaded into couchdb. Details, see actual data.
  • Running time (expected):
    • scraping/downloading data: 10days
    • analysis+load into database <1day
  • Covers, – are copyrighted, so the app cannot copy/cache them, but it should be ok to link to them
    • we can get coverurls for some materials by the vejlebib-api, needs requerying as they timeout
    • there are covers for many isbn’s on bogpriser.dk in the form of http://www.bogpriser.dk/Covers/611/9788759517611.jpg where 978…611 is the isbn, and 611 is just the last part of the isbn
    • we should be able to cache the dimension/ratio and major/average colors, for incremental load.
    • Idea, not implemented this sprint: autogenerated covers

Progress

Done:

  • load into couchdb
  • if time permits: SVD of loan data to find “genre?”-dimensions.
  • svd->doc-vector, rounded, ie. float("%.3g" % (num,))
  • svd started
  • conversion into triples encoded for json/couchdb a la {"_id": "ting-id", "prop1": ["val"], "prop2", ["val1", "val2"]} for [["ting-id","prop1","val"],["ting-id","prop2","val1"],["ting-id","prop2","val2"]].
  • download hack4dk data from DBC + recompress as xz
  • script for scraping metadata from the vejlebib dev api (takes ca. 10 days to run)
    • include infor whether cover image is available through the api
    • includes isbn etc.
  • Looked into+installed Mahout(Cloudera/CDH), but concluded after some experimentation/study that it is overkill when I only have a single machine and not a cluster
  • Implemented script that finds recommendations/related materials, using my weight from previous experiments. (takes one night to run)
    • The relatedness of a to b is given by number of a cooccurences with b / sqrt(10 + total count of a). This essentially how often a cooccurs with b, with popular materials weighted down, to avoid that they become overrepresented. The sqrt(10 +...) weighting factor (sqrt gives a monotonously increasing weight somewhere between identity (too steep) and log (not steep enough), and the 10+ makes sure that small counts do not affect the result dispropotionally).