[Thread Prev][Thread Next][Index]
Re: [las_users] how to frequently update thredds dataset in las
Roland Schweitzer wrote:
Hi Robert,
You are struggling with issues that have been our radar screen for a
while. I appreciate having the details of your use case so we can
make sure the solutions we're working on meet your needs as best we
can. Below are some questions, some ideas and a few solutions.
Roland thank YOU and your team for your dedication to this project which
has opened up new opportunities.
Robert Fuller wrote:
Hi,
We're setting up a configuration where a Thredds dataset is
accessible through LAS. The dataset is a regularly updated Thredds
aggregation of ncml files, an oceanographic forecast generated with
ROMS. The Thredds dataset has the same URL before and after the update.
At the moment we are using addXML to regenerate the relevant section
of las.xml, then using tomcat manager to reload the las servlet to
pick up these changes (in normal operation only the time arange
element will change in the las dataset).
This method of regenerating the las dataset is not ideal for a couple
of reasons:
1. Ideally we would use a natural dataset id rather than the one
generated from addXML (the id is based on a hash of the thredds url,
so it is persistent which is good, but it is not humanly interesting)
Can you explain to me why the ID should be "humanly interesting".
I've gotten some push back in the past from other LAS developers about
using a hash for the ID but I haven't been convinced there's a better
solution. Of course, internally LAS uses the ID to reference datasets
and if you were to create, save or type a LAS URL by hand you'd have
to know the ID, but the user interaction via the standard LAS user
interface should not directly require a human to know the ID. There's
probably a use case for having a human readable ID so I'd be
interested to hear about how it would help you.
If you're talking about the data set name that shows up in the LAS UI,
you can control that with a switch on the addXML command line to
either pass in the name you want or to tell addXML which global
attribute to use for the name.
Yes, in the LAS gui the id's are irrelevent.
One such case where it would be handy to use some humanly interesting
id's is for a deployment where we are representing the same datasets on
dev, test and production servers. I want to provide some simple script
fragments (las xml and wml) to some users/administraters who are not
familiar with this domain. It would be 'nice' if other than the host
name, these scripts will work out-of-the-box without too much thinking
on the part of these users.
2. Ideally we would use natural variable names rather than those
generated by addXML (perstent but not humanly interesting)
AddXML does the best job it can extracting the variable name from the
actual metadata in the file, but it could do a better job. As I type
this I realize it probably doesn't look for the CF standard name.
Right now it looks for a "long_name" attribute and uses that. If no
long_name is available it uses the actual netCDF variable name. Would
it help if it looked at the CF standard name (which are pretty ugly
what with all those underscores and no capitalization) or if you could
pass in the name of an attribute to use for the variable name.
The key to being successful with this issue is to work on addXML until
it can produce LAS configuration you like, because addXML will be the
basis of some automatic update facilities we're building right now.
Actually, what shows up in the gui is good. I'm happy with that.
My one line perl script fixes my addXML output to something that works
for me. For the NE Atlantic output in foo.xml I use this line:
perl -p -i -e
's/([\w\d-]*?)(-){0,1}([xyzt]-){0,1}(id-[a-z0-9]+\b)/ne_atl_rect$2$1/g'
foo.xml
3. After a number of reloads of the las servlet, tomcat process runs
short on PERM GEN space, which requires restarting the tomcat server
to resolve.
I have looked into a couple of ways of improving the situation:
i. use script to generate the las dataset xml setting the arange to
the new values, then reloading the LAS servlet. This would address
most of the issues noted above, but not number 3.
ii. modify the las.xml dataset in such a way that it will still work
after the thredds dataset has been updated. I've tried two options
a. Remove the time arange from the las dataset. This breaks the las
gui (date control vanishes) and also the wms service (including
GetCapabilities).
b. Set the time arange to an early start date with a large number
of steps. This works in the las gui provided the user knows the
correct date to pick, but still breaks wms.
My suspicion is that if I do not reload the LAS servlet we will also
encounter problems with LAS caching older views.
Not sure what you mean by this. If the only thing that happens is
that the time range of data set extends, then a cache hit on a plot
for Monday should still be valid on Tuesday even though the data set
only extended to Monday when the plot was made.
You never know, our forecast could change ;-). The image cached
yesterday may have shown sunshine tomorrow, but today the model is
forecasting snow tomorrow. When the dataset has been updated we need to
clear any cached results. This is to say the dataset not only extends
(rather than extends, it slides), but also some of the data within the
set may be replaced.
I welcome suggestions on other options for updating the las dataset
after the thredds dataset has been updated.
We have some help for this problem coming out in the next release
(which will be sometime in a couple of weeks -- I promise :-}). The
next release will include a beta version of LAS manager's interface.
One feature of the interface is a reinit process. Using this
interface you can go to a particular URL (which is access-controlled
by the same mechanisms that control access to the THREDDS Data Server
installed with LAS) and ask LAS to reload it's configuration. Since
this is an internal process to the running LAS servlet, it should not
have the same problems with the container reloading the servlet.
kewl.
You probably want to hit the reinit URL from a process instead of a
browser and I'm not sure exactly how to accomplish that right now, but
some other folks we're working with want to do this also so we'll
figure out a way.
we use wget
The manager's interface also includes a cache manager's interface
where you can empty the entire cache or clear the cache of only those
files associated with a particular dataset.
perfect.
Finally, in a future release -- not the one coming up this month, but
relatively soon after we will have some new configuration options in
LAS. This will allow you to configure an LAS directly from the
THREDDS URL and specify how often you want LAS to re-initialize its
configuration from that catalog. The new config will look something
like this:
<dataset
src="http://pcmdi3.llnl.gov/thredds/esgcet/catalog.xml"
src_type="THREDDS" update_time="23:00" update_interval="24 hours">
<properties>
<addXML>
<esg>true</esg>
<units_format>yyyy-M-d</units_format>
<categories>true</categories>
<global_title_attribute>experiment_id</global_title_attribute>
</addXML>
</properties>
</dataset>
ok. For us the on-demand feature mentioned will be more useful than the
schedule, but I can appreciate this use case also.
In this case, LAS will add the data sets and variables it finds in the
THREDDS catalog using the addXML parameters in the properties
section. This means the dates will be parsed using "yyyy-M-d", the
categories matching the THREDDS hierarchy will be included and the
data sets will be named using the value of the "experiment_id" global
attribute in the file. (The "esg" parameter means the catalog
includes ESG metadata and LAS will use that metadata when building
it's configuration).
Using the update_time and update_interval, LAS will mark each dataset
with an "expires" attribute. After the update_interval has passed (24
hours in this case) the next time after that when 11pm rolls around
LAS will re-initialize itself. During that process, the configuration
for this THREDDS catalog will get regenerated because its expires date
will have passed.
If a catalog is marked with an expires date, but that date has still
not passed then the configuration for that catalog will be read from
the cache. If the original configuration did not include an
update_time and update_interval, the configuration will always be read
from the cache and the catalog will never be re-read unless the
configuration for that catalog is not found in the cache for some reason.
Finally, LAS will compute the next time it should reinitialize based
on the minimum time to the next "expires" time.
Yes that sounds interesting and useful alright.
This means that addXML logic is now internal to LAS so this is why
it's critical for us to figure out how to get a configuration that
meets your needs directly from addXML. Help me out with suggestions
for how addXML could do a better job reading the catalogs you're
interested in using.
My idea is add a "-idbase" command line option which will be used rather
than the sha1 in generating the ids. This way I cand call addXML -idbase
ne_atl_rect which will have the same effect as my perl 1 liner!
Thanks again, Roland, yer doing great work!
Roland
Thanks,
Robert.
--
Robert Fuller, Applepie Solutions,
5 Woodlands Avenue, Renmore, Galway, Ireland.
+353.86.0507760 http://www.aplpi.com
Registered in Ireland, no. 289353
[Thread Prev][Thread Next][Index]
Contact Us
Dept of Commerce /
NOAA /
OAR /
PMEL /
TMAP
Privacy Policy | Disclaimer | Accessibility Statement