[Thread Prev][Thread Next][Index]

Re: [las_users] how to frequently update thredds dataset in las



Hi Robert,

You are struggling with issues that have been our radar screen for a while. I appreciate having the details of your use case so we can make sure the solutions we're working on meet your needs as best we can. Below are some questions, some ideas and a few solutions.

Robert Fuller wrote:
Hi,

We're setting up a configuration where a Thredds dataset is accessible through LAS. The dataset is a regularly updated Thredds aggregation of ncml files, an oceanographic forecast generated with ROMS. The Thredds dataset has the same URL before and after the update.

At the moment we are using addXML to regenerate the relevant section of las.xml, then using tomcat manager to reload the las servlet to pick up these changes (in normal operation only the time arange element will change in the las dataset).

This method of regenerating the las dataset is not ideal for a couple of reasons:

1. Ideally we would use a natural dataset id rather than the one generated from addXML (the id is based on a hash of the thredds url, so it is persistent which is good, but it is not humanly interesting)
Can you explain to me why the ID should be "humanly interesting". I've gotten some push back in the past from other LAS developers about using a hash for the ID but I haven't been convinced there's a better solution. Of course, internally LAS uses the ID to reference datasets and if you were to create, save or type a LAS URL by hand you'd have to know the ID, but the user interaction via the standard LAS user interface should not directly require a human to know the ID. There's probably a use case for having a human readable ID so I'd be interested to hear about how it would help you.

If you're talking about the data set name that shows up in the LAS UI, you can control that with a switch on the addXML command line to either pass in the name you want or to tell addXML which global attribute to use for the name.
2. Ideally we would use natural variable names rather than those generated by addXML (perstent but not humanly interesting)
AddXML does the best job it can extracting the variable name from the actual metadata in the file, but it could do a better job. As I type this I realize it probably doesn't look for the CF standard name. Right now it looks for a "long_name" attribute and uses that. If no long_name is available it uses the actual netCDF variable name. Would it help if it looked at the CF standard name (which are pretty ugly what with all those underscores and no capitalization) or if you could pass in the name of an attribute to use for the variable name.

The key to being successful with this issue is to work on addXML until it can produce LAS configuration you like, because addXML will be the basis of some automatic update facilities we're building right now.
3. After a number of reloads of the las servlet, tomcat process runs short on PERM GEN space, which requires restarting the tomcat server to resolve.

I have looked into a couple of ways of improving the situation:

i. use script to generate the las dataset xml setting the arange to the new values, then reloading the LAS servlet. This would address most of the issues noted above, but not number 3.

ii. modify the las.xml dataset in such a way that it will still work after the thredds dataset has been updated. I've tried two options a. Remove the time arange from the las dataset. This breaks the las gui (date control vanishes) and also the wms service (including GetCapabilities). b. Set the time arange to an early start date with a large number of steps. This works in the las gui provided the user knows the correct date to pick, but still breaks wms.

My suspicion is that if I do not reload the LAS servlet we will also encounter problems with LAS caching older views.
Not sure what you mean by this. If the only thing that happens is that the time range of data set extends, then a cache hit on a plot for Monday should still be valid on Tuesday even though the data set only extended to Monday when the plot was made.


I welcome suggestions on other options for updating the las dataset after the thredds dataset has been updated.
We have some help for this problem coming out in the next release (which will be sometime in a couple of weeks -- I promise :-}). The next release will include a beta version of LAS manager's interface. One feature of the interface is a reinit process. Using this interface you can go to a particular URL (which is access-controlled by the same mechanisms that control access to the THREDDS Data Server installed with LAS) and ask LAS to reload it's configuration. Since this is an internal process to the running LAS servlet, it should not have the same problems with the container reloading the servlet.

You probably want to hit the reinit URL from a process instead of a browser and I'm not sure exactly how to accomplish that right now, but some other folks we're working with want to do this also so we'll figure out a way.

The manager's interface also includes a cache manager's interface where you can empty the entire cache or clear the cache of only those files associated with a particular dataset.

Finally, in a future release -- not the one coming up this month, but relatively soon after we will have some new configuration options in LAS. This will allow you to configure an LAS directly from the THREDDS URL and specify how often you want LAS to re-initialize its configuration from that catalog. The new config will look something like this:

<dataset src="http://pcmdi3.llnl.gov/thredds/esgcet/catalog.xml"; src_type="THREDDS" update_time="23:00" update_interval="24 hours">
            <properties>
                <addXML>
                    <esg>true</esg>
                    <units_format>yyyy-M-d</units_format>
                    <categories>true</categories>
<global_title_attribute>experiment_id</global_title_attribute>
                </addXML>
            </properties>
       </dataset>


In this case, LAS will add the data sets and variables it finds in the THREDDS catalog using the addXML parameters in the properties section. This means the dates will be parsed using "yyyy-M-d", the categories matching the THREDDS hierarchy will be included and the data sets will be named using the value of the "experiment_id" global attribute in the file. (The "esg" parameter means the catalog includes ESG metadata and LAS will use that metadata when building it's configuration).

Using the update_time and update_interval, LAS will mark each dataset with an "expires" attribute. After the update_interval has passed (24 hours in this case) the next time after that when 11pm rolls around LAS will re-initialize itself. During that process, the configuration for this THREDDS catalog will get regenerated because its expires date will have passed.

If a catalog is marked with an expires date, but that date has still not passed then the configuration for that catalog will be read from the cache. If the original configuration did not include an update_time and update_interval, the configuration will always be read from the cache and the catalog will never be re-read unless the configuration for that catalog is not found in the cache for some reason.

Finally, LAS will compute the next time it should reinitialize based on the minimum time to the next "expires" time.

This means that addXML logic is now internal to LAS so this is why it's critical for us to figure out how to get a configuration that meets your needs directly from addXML. Help me out with suggestions for how addXML could do a better job reading the catalogs you're interested in using.

Roland


Thanks,
Robert.




[Thread Prev][Thread Next][Index]


Contact Us
Dept of Commerce / NOAA / OAR / PMEL / TMAP

Privacy Policy | Disclaimer | Accessibility Statement